 So, I'm Natalie Stickel and I work at UHN in the Bioinformatics and HPC Core there and for HPC for Health as well. I started work in the research aspect of genomics but have since become primarily responsible for moving our pipelines into the clinical space. We process all of the clinical data for Toronto General Hospital's labs and Princess Margaret Cancer Center as well. So, the learning objectives of the module are to gain some insight into the complexity of clinical accreditation to help you understand how the goals differ for research versus clinical testing and to appreciate the importance of validation which is already mentioned but also validation for infrastructure as well as software and your results. We want to understand some of the problems and pitfalls of a panel based genetic test being used as a clinical test and also some insight into the complexities of incorporating your next generation sequencing data into an e-health record. So, this slide is to demonstrate sort of that complexity. When you're dealing with clinical bioinformatics, really it's more than just processing the data from a panel which is sort of primarily what research you process it to do some annotation perhaps and pass it off. On the clinical side, at least in our experience, it's really that but also the standardization of that process in all aspects, your pipeline, how you label things, how you interact with the lab has to also be standard and then where you store and archive that data and how that data is archived, how you can pull it out of the archive all has to be very standard and documented. Also of course privacy and security are at the forefront of any clinical data and so same with next generation sequencing results. And then there's a communication aspect, you're working with a clinical lab, there's a lot of time constraints so you need to really streamline and document the way that you're going to communicate between them so that you don't waste time in that aspect. Hold them up. For research, the primary goal would usually be discovery. You're often looking for things that are unknown, it's exploratory. You're still interested of course in reproducibility and sample quality and cost but it's different from the clinical standard of care primarily is even more different and that you're really looking for things that are actionable only. So when you do the annotation, you can focus in or they want to focus in on things that have either if it's a clinical trial, something that has a clinical trial either a drug that is known to be actionable on that variant or also things that are known to have a prognostic value. There's lots of standards that are involved in both the clinical trial labs and clinical standard of care labs. They're both CAP, CLIA, and OLA, those are the main requirements. We'll talk about those in the next slide. That covers things like validation, again of the infrastructure of your software, of your results. Also it dictates how you have to version and track your pipeline pieces of software and your SOPs. There has to be an audit trail on all actions and interactions with that data and quality control of the samples and how you can try to determine whether a sample swap, are you following those samples through the process from the beginning to the end and you know that you have the same sample and the same data. It's also really different from research in the way of thinking. So when you're interacting with the lab and with the clinicians, they're used to dealing generally with more standard black and white lab results. Most other standard labs that tests that are incorporated in clinical labs that are used in standard of care have much more defined boundaries and are less fuzzy in the result than you've even seen today, how fuzzy next generation sequencing data can be and that can be difficult and you have to communicate that back to the lab and the clinicians as part of the process of using one of these tests in standard of care particularly. Also they often choose smaller panels of genes versus whole exome or whole genome sequencing. Partly that's cost related, you know a lot of these tests are funded by the government and only funded for certain genes. They may be running a 50 gene panel but only reporting on 20 of those genes because those are what the Ministry of Health will cover cost wise. And so you wouldn't want to run an exome or a whole genome and then only report on five or 10 or 20 of those genes. But also they don't want to find incidental findings. So if you are trying to report on a certain number of genes but you run a whole exome and you find some other variant in the gene outside of your scope that is possibly causative of some other disease, do you report that to the patient or not? And there's a lot of guidance that's out there about that but it's still a bit of a difficult situation so they would rather not see those at all. So for accreditation the main regulatory bodies are CAP and CLIA in the states. So if you want to process and deal with samples that originate in the United States even if we're in Canada you need to have the CAP and CLIA standards and accreditation and requirements in place. In Ontario we also have OLA, the Ontario Laboratory of Accreditation, which is also related to the Institute for Quality Management in Health Care. And basically a lot of the requirements and standards for all of these are very similar. So if you're meeting the CAP and CLIA requirements you're most likely meeting the OLA and Institute of Quality Management but you may have to report it or document it in a slightly different way. So the basics of accreditation in terms of what as a bioinformatician we have to worry about are really the documentation. Everything has to be documented in the clear format, standard formats and that that then gets signed and dated by the laboratory director. And those things happen both the validation and any other documents that are related to the data processing that all has to be in place before the panel can be used and they can sign off and sign out results as part of the clinical guess. Is there still have to be paper or there's electronectrally signed documents that are also legal and valid? Yeah, I think that they do have, in our labs anyway, they do have the paper signed but they don't have, okay, we have more than one lab location and they have a paper copy in one location and they decided that electronic copies are acceptable for other locations because really they're supposed to be present and all in the labs doesn't matter that they're sort of under the same director. So they have not gone to fully electronic but I think that that could happen in the future. It's probably acceptable, I'm not sure on that honestly. But right now they are very much in the paper with one signed copy at the hospital in general. So because these documents have to be available for an inspector when they come. So they usually come in a prescribed time frame. It's not a total random show up one day, completely unexpected usually. It's a set standard frequency and so you'll get a time frame when they'll show up and the documentation does need to be there to show them and generally they like to sit there and flip through the blinders. So that's how it has been in the past but I think it probably will move to more electronic as we go forward. We need to spot test the pipelines using data that is sent by these accreditation bodies. So they will send FASQ files and run it through the pipeline and can then send the results back in a blinded way. We don't know what their results are supposed to be and you'll get a report back later and this is done across many labs with the same set of data and then you get a report back saying well most people found these variants or you know you get to see sort of where you stand in the process. If you have a panel like sequence, how are they supposed to do it? So they'll do it for a particular panel. Yes, so they won't tell you can't exactly. It's not going to test. We have not gone through this process for all of the panels that we have validated. It has been sent for one standard panel like for a commercially available panel, not a custom gene panel. So if you're using that commercially available panel then you would participate in this analysis. And that does leave a lot of caveats. For example, many of well in so far in the time I've been here the ones that they have used as these tests have all been Amplicon based panels. So if you're not testing a different method like hybrid capture you know how our pipeline is quite different for those two. So it's only a litmus test, it's not the final answer there. So the documentation for Bioinformatics only includes validation documentation for all the panels that you have. Again if you're using a hybrid capture panel and you have multiple gene sets but they're all based on the same technology it's actually acceptable to use sort of a combination of data from all of those when they're the same technology just different sets of genes. You can combine all of that together on one validation to increase the power. If you have a small set of genes in one panel and another panel with another set of genes using all of that data and all those variants together is more powerful and they understand that given that the technology is the same across those your results are going to be consistent and does get more power to the validation. You need to document the pipeline details how which software we're using, what versions, what commands exactly where we downloaded the files to use for the reference genome or DBSnip or anything else all of those things have to be locked down. As was stated before you can't use the latest version necessarily without doing a whole lot of work again. So you choose one you lock it down and you stick with exactly that until you revalidate. Also the procedure for updating the panel or our pipeline as I'm sort of alluding to would have to be documented as well. So yes you have to lock it down and use only one version but of course you will want to upgrade at some point. You will want to update maybe there's a bug in one of the versions of software you're using so you need to have a procedure in place to be monitoring that and to implement the updates if necessary. You also have standard operating procedures. So for someone to run the pipeline and to process the results they have to follow that document exactly. And as such we have an exception log if something happens as it does occasionally where something's wrong with the files or something goes wrong while you're processing you have to document what happened, what the change was and what you did about it. We also have a data security policy which stipulates how we're protecting the data, how we audit our storage and archival and retrieval process and all that. So part of our pipeline documentation is for us it's a custom built analysis pipeline so we have written a full manual that details all of the steps all of the commands involved, what the expected result would be what kinds of files go in, what kinds of files come out where you might see variants falling out in that process because of certain restrictions we have in the variant caller mapping quality or base quality or so on all of that is documented in this document and it's at about 50 pages at the moment. So it's quite extensive. Then we also have each step in the pipeline write to a log file with a timestamp and we can use those files to error check for we are using a cluster, we're submitting multiple jobs across the cluster and sometimes a node will fail and some of the jobs will then not be completed not processed properly while when you're launching thousands of jobs at a time how can you tell that when it writes the exit status back to these files then we can you know check that at the end doesn't match the number of jobs you submitted to the number of jobs that finished successfully or were there some failed jobs. This is a little excerpt of the standard operating procedures again it covers has a purpose scope the required components you would need to run the pipeline it has definitions and then a very detailed outlining of all the procedure and all the steps you would have to do in order to execute the pipeline. Now our pipeline I would say most pipelines you'll be running if you're doing a standard thing in the clinical lab anyway especially you will have maybe one or two commands you will launch and they will then proceed to run a whole bunch of subsequent steps. So the document really is defining how did we get the data how are we alerted to get the data from the lab in the first place what information do they send to us how do we process that data the little excerpt there is a pipeline requisition form where they have to tell us what panel they ran what is the sample a normal sample or a tumor sample is it processed on which version of the chemistry did they use all this kind of information that you need in order to when they're running multiple different panels as they do you know how do we go about which steps in the pipeline do we need to use which some commands will be slightly different depending on the panel. So another thing to consider when you're setting out to set up a pipeline is and to validate it is how to set out your workflow. So if it as an example you have a tumor sample and a normal sample and you're interested in somatic variance you might also be interested in term line variance in the same set of in the same patient. One way that you can look for somatic variance is to run the tumor and normal samples in how to actually use the pointer there but on the left side you can see if you run the tumor and the normal together they process together which gives more power to the steps of indel realignment and and so you process the samples as if they're one sample at that point in those recalibration based recalibration indel realignment steps but then you can use a somatic variant caller where it takes again information from the tumor and the normal together and will output only calls that it considers to be somatic. So that's on the left. On the right you could also alternately process your tumor and your normal more separately and end up with variant calls for each individually and then sort of do some kind of subtraction for the blood in the tumor. That's what the clinical lab was mostly doing as is depicted on the right side before we took over the pipeline and sort of worked on it with them and there's caveats to both I would say. You get more power from processing the samples together as one. It's true but in each case you can potentially miss some variants in either the germline or the somatic category. This slide sort of tries to show that so when you filter a blood sample the calls the variant calls from a blood sample looking for germline variants they'll have some kind of filter for frequency. In our lab it's 20 percent. It's generally considered if it's above 20 percent then that's likely to be considered a germline variant. So then if you process the blood and tumor together for somatic calling let's say you had a variant that was present in your blood sample at 5 or 10 or 15 percent it would not then output that variant in the tumor as a somatic call because it's saying no it's present in the blood it's present in both samples by the somatic variant caller would see it in both samples and say no that's a germline call but in your germline data you would have filtered it out as not being a germline call based on the frequency you're using as your cutoff for calling something a germline call. So if you didn't consider that and do something else then you would potentially lose maybe it's a variant that has actionability whether it's germline or somatic you'd miss that call. So in our case we also then process the tumor separately and you have to kind of go back and look well did I see anything in the tumor if I only look at it alone yeah so there's a lot of steps involved and mostly now we've moved away from using the normal and tumor together just because of these caveats and also for cost. So also documentation of the workflow and the data flow so this is having an image from one of our documents on how the data moves through our systems from the sequencer it gets processed written directly rather to one of our clusters and then it gets moved and processed in various different locations so all of that is documented and described in more detail in the documents that are on file with the clinical lab but basically it also is depicting the way that the data is archived and backed up at what frequency depending on which state it's at your raw data files are really small and to you need to sync them nightly we do that at first but you wouldn't want to continue to do that all the time in order to make a backup of them they're not changing and so once we're finished with the processing we tie them up and move them over to a more permanent archive where we don't have to move all those files back and forth all the time so things like that they make perfect sense but you need to sort of say well this is the standard way we're going to do it and why all of that is written down so when you go to validate the panel there's lots of different places that you can get data in order to calculate your sensitivity and specificity and to determine if the methods you're using for variant calling are the best ones as was described earlier with the false positives and false negatives so this picture sort of shows probably all or most of the different ways you can get data for these calculations you can use data from this generated inside the lab we often use if they have another test they've been using then we'll pull all of those variants in that they've calculated from or found from using a PCR based test or from using Sanger sequencing what they've done in the past time we'll pull out the samples that have no invariance from those techniques and run those on the new panel and then see how many do we call of those and that's a good true positive, false positive false negative way to calculate we also use the data from the Coriel cell line that he was showing before the NA12878 that's very standard cell line has a lot of well defined variants in it one problem we have sometimes is that if you're using a smaller gene panel then there's a very limited number of high quality variants in that data set there's those small number of genes but as I sort of said before you can use a larger panel that's of the same technology and that can increase the numbers in your validation by doing that also there's another cell line that has a lot of data as well the NA19240 similar to the other one but a lot less a lot less high quality data but with both of those cell lines it's still a problem that they have a certain set of variants that you'll find as high quality you can be confident in those so if you call those you can say yes I did a good job at calling those but you'll for sure call other variants outside of those some of them might match up with their low quality data set which they provide but some of them won't and you'll call maybe some that aren't in that set and some in the set you won't call and how do you know which ones are false positives and which ones are false negatives it becomes very difficult very quickly to do a full calculation but with true positives it's not so hard to find those to call those so other things that you can do are to use in silico data sets we can take data often we've taken data that we've generated from a particular panel and spike in variants or copy number or whatever changes into that BAM file and then use that to test your caller there's also caveats to that because there's errors that are introduced by the software itself so again not a perfect system either but if you use all of these tools together all of the different data sets together that you can gather all the pieces of information together end up being a pretty solid set of data that you can then come up with a pretty good sensitivity specificity result for your documentation but if you use tried to use just one or two of these you would find there's a lot of holes so things to consider for clinical NGS testing would be the design of the panel often I mean just what is the goal in the first place are they going to be running samples that are blood or tumor solid tumor versus myeloma they need different gene sets but they also need potentially a different panel or you could use a different panel for a lot of the leukemias they have a very defined set of variants that they're interested in and that are actionable so traditionally they could use a more a smaller amplicon based panel and look specifically for only those variants because that's all they're interested in so if that's what they're they're really only going to report on five or ten different genes and only about you know twenty variants within those then maybe that's a good way to go the cost is lower and you have a lot less work to do with your validation so those are are good things but if they're interested in profiling solid tumors or other things that are much more unknown or there could be more clinical trials in the future adding more actionable genes in as you go then a larger panel will take you longer to validate but might actually be more useful in the end also are they interested in germline or somatic or both and what cost I mean is it something covered by the Ministry of Health there's definitely an aspect of cost then also with validation what's the gold standard so it talked about using data from other tests in the lab to become the false positives or sorry true positives and true negatives for the validation of NGS test but those other methods have other problems themselves the sensitivity of Sanger is much lower than the sensitivity in NGS so how do you then if we're looking at variants in a tumor that we want to be able to call down to five or even less percent there's not really a good way to check those with Sanger so we also need to understand the different parameters of the algorithms that we're using and how changing those might have changed our result and why why did we make those changes our method has always been to start with the default of the color and then to change certain things in order to improve our sensitivity and specificity and then document why did we make that change so there's many colors many of these variant colors we've been talking about or even GADK when you're processing the band they have a million different arguments that you could use you could tweak a million different things how do you know which one to do so you really can't you can't change them all you can't effectively test them all we don't have a good enough known data set to be able to just run through all the different possibilities and figure out programmatically which is the right one which would be ideal so all you can do is start with what is the recommended generally used settings and then see what happens to your data if you manipulate a few that makes some sense when you read what they're about so I would also just add that large panels are really complex to validate I think I kind of said that before and when we validated the largest one as far as a 555 gene panel and that document is like almost 300 pages long so it gets to be a little bit unwieldy with that panel we'll talk a little bit about the validation and some of the issues that came up these are genes that are related to cancer and they're intended for screening cancer patients for clinical trials generally so to validate this panel we used a smaller panel that had been previously used in the lab it's the TrueSeq Amplicon Cancer panel so it's an Amplicon based panel and only 48 genes that's not a really great comparator to a hybrid capture panel of 555 genes so we had to think of another way there'd be far too many holes in that data for us to really be confident in our validation using only that known set even if we added in the Coriel cell line 2878 I believe there's about 600 variants in the high quality data set within the regions of this panel that we're validating it's a pretty good number but when you're talking about that many bases that you want to call variants on it's actually kind of a small number so as a way to add in some more data we used another panel the Comprehensive Cancer Panel which is 409 genes and that's run on the Ion Torrent technology so totally different technology from the Illumina system which was which was done on purpose in order to be confident in anything that is called by the two methods together but added a lot of complexity in what was correct when they disagreed we did also try some synthetic data but again like I mentioned before these tools that you can use to manipulate the files have a lot of errors and problems in themselves so there's something you can try and I think that these tools will probably continue to improve but at the time it was several years ago it was kind of frustrating process to get them to work they just added errors of their own so another thing to think about related to this validation would be how do you compare variants particularly between the two technologies when you're talking about an Amplicon base panel versus a hybrid capture panel there's a lot of issues talking about using MySQL reporter versus the GTK methods they don't all even call the same variants the same way for example you can have different justifications of a variant so if we're talking about the reference being CAG with the A being the deletion you can report that in those three different ways they're both the same they have the same meaning but you can't easily compare those especially when you're trying to do it over many samples in that many genes you might have had to make a programmatic way to compare the VCF files if you don't first correct for this kind of issue you won't be able to do that effectively so as a result if you don't do that effectively or even if you do you can get really stuck in these weird as you know lost in the weeds is the way of saying the weird areas of these aligners of these two different systems so the same sequence the same sample run on the Illumina panel versus on the Ion Torrent method and you know the Illumina clearly calls it as a deletion of all of those bases whereas on the Ion Torrent it probably still is a deletion but when you see it in a VCF it gives you a series of SNVs how do you possibly you can't really normalize that out and to do enable to be able to compare these VCFs just by a program but it's difficult to do it more automatically which is what you would want to do another issue is that there's lots of artifacts in sequencing generally especially when you get into larger panels so one thing that we did was try to plot them all to see which ones, this is all the variance of the panel for over a hundred samples and this is variance that were called in comprehensive cancer panel for this validation so we know that the Ion Torrent so in general calls more artifacts especially around homopolymers and so on so we're like how can we find these and pull them out without having to sift through it all by hand so plotting them we've plotted the percent frequency on the x-axis and the frequency is on the y-axis and the number of times it was called is on the x-axis so anything that is at the bottom sorry the right hand bottom would be called almost every sample at a really low frequency same thing top right in every sample at 100% you can see the pattern that comes right through the middle is as you would expect you expect things in there's going to be a lot of variants that are real at 50% and at 100% okay and so we're not so interested in things like that those are variants that are common in the population and we want to filter those out as well another thing is what happens if you look at indels versus SNBs so they're colored differently here the SNBs are in pink and the indels are in green and so how does this help us try to determine which of these pockets on this plot are the artifacts if we wanted to just pull out variants from this data and make a list of them that we would filter out of all our data later on which was our goal how do we know exactly where to put the lines to cut off no matter if it's an indel it does but I'm not sure where we got from that so the next step we did was to remove things that are very common in the population so you see that line at 50% leaves all the ones right across the top 100% leaves so that helps a lot really in the end the bottom right hand and perhaps somewhere up the top moving upwards from there when those variants that are present are at a low frequency those are artifacts most likely where to put the cut off of the frequency is the question and we don't have an answer for that this is something that we're still working on actually how do you determine a real artifact list programmatically in our lab the variant annotators make their own list as they go so we're trying to help them with that because that's very time consuming when you're talking about a large panel and I'll talk a bit about some other pitfalls the variant callers that are important for validating your clinical panel no variant caller is perfect some of them have more filters and are more stringent but you're more likely to miss a real variant if they have more relaxed criteria well then you might call something that is real but how do you go back and figure out was it real and can I report it out as a clinician one of the major problems we've come across is that many callers will only call one variant for genomic location this is good in a way that it helps you remove some false positives if there's a spot where there's a few of one change and many of another it won't call two variants in that position that would cause you problems to eliminate that but it is problematic as we'll see here so if there's two variants at the same position there's two possibilities one that both variants are real it could be clonal variation from a tumor but the other possibility is that one variant is real and the other is an artifact due to some kind of homologous region or a sequencing error or something else so this is an example and we don't know I don't know which is the answer here there's a SNP and then there is an indel right after it and it's not a particularly repetitive region but well there is a little bit of repetitive there so I don't know the SNP is at the indel I actually don't know the answer in this case but the SNP is at about 16% the indel was at a higher percentage I forget exactly and all of our variant collars reported the insertion I believe one of the four that we use called the SNP as well but it's there it actually has a high population frequency so it's probably a real variant or it could be in some other case does there also an insertion there well I don't know and so this was actually something that one of the lab technicians called me up and wanted me to look at because our VCF that we provided them only had the insertion but when she goes to review it in IGV as they do she's like well why isn't the SNP there now she says in this case it doesn't matter because it's a population variant we're okay with that but they're very concerned but what would happen if in a similar case in some other location another issue is complex variants that can be sometimes reported by variant collars as multiple variants on separate lines but they're close together are they really a more complex variant so here are two deletions that are only separated by three bases it's most likely of much larger insertion deletion of a larger region with a little bit of an insertion put back there's not really any way to fix this programmatically I mean some variant collars do a better job at making a combined call of a region like this and some don't so it's something else to look for when you're evaluating the variant collars and then there's a third case we can't come across where again with a complex variant we found that it can sometimes matter how the aligner aligns it and it's kind of random and by chance so in this case there's a deletion of the sequences in the box and it to change to an A this is the same sample run on the same panel but the top was processed by or was run rather on the my seek and the bottom was run on the next seek and is that the reason that they're aligned differently I'm not entirely sure it just happens it could have been entirely random chance although I think it's more likely to happen on the next seek because when you run things on the next seek or other sequencers that have multiple lanes the sample is split across the lanes and in this case I processed we process each individual lane and only one lane has the alignment that's at the bottom but it was processed multiple times it never changes it's always the same and somehow it overwhelms the other and it becomes this so okay why does it matter they both end up being the same sequence actually the same change if you pay but in the case of the bottom panel the caller that we primarily use won't call the we can't call two bases at one position and the deletion would be called at the same position as the SNP so we only end up with a SNP so I'll briefly talk about electronic medical records and how this is related to NGS data and clinical labs they have a standard format for sharing and integrating data across clinical systems and it's called HL7 health level seven it's been used for 20 years it's not an encrypted it's just really plain text you can see this little panel I have at the bottom is literally what the message looks like it has a really complex structure so I kind of color coded it by sections each one starts with a three letter three letters so MSH you'll see PID, PVI, ORC, OBR, OBX and they all each section then has multiple fields that you can put different pieces of information in that are very highly well defined so if it's in for example PID 5 it's the full name that's Mr. John Doe if you count across that's PID 5 and so on for all of those different fields so this is a snippet of the manual for this system where this method is standard that shows you what information you can put into the OBX segment and this is like the result section so if you look it over you will notice things like PDF, image, JPEG so you can put files that are of that nature into the system and it will transmit them back into the clinical EMR but really there's not a whole lot of else that you can put in there so there's a specific number for a test if you get a blood test back with a level of whatever it is they measured in your blood you can put that result back into the system but there's not really much space for NGS results here so even just one individual's genetic data is really large and complex and needs a lot of curation so there's a new standard emerging which is called FIRE fast healthcare interoperability resources and it's a specification in 2014 and it's supposed to help with this problem so this is a bit of an example you can see that there's a sequence and it can be broken down into parts and basically there's about four different ways you can reference back to that sequence you can reference to the RefSeq you can reference to the change there's different methods that they have to sort of take snippets of the genetic information from one of these tests and encapsulate that into a message that can be put into an EMR it's still missing quite a lot you can't capture the whole VCF file you can't capture a whole BAM and I think you'll notice or if you do any reading or if you've worked with this data already then you'll see that you probably already had to reprocess files things changing very quickly you don't really want to just necessarily permanently record one snippet from somebody's NGS result as being the permanent file what if we want to go back and have a new method a new way to filter out noise from the sequencing data later we might want to go back and reassess that file how would we do it this was all that we had in the electronic record the idea is that probably want to at least eventually attach the entire NGS result to somebody's EMR that is an unanswered question and here's really what demonstrates the problem this is showing the amount of the growth of DNA sequencing over from 2000 when things first started and it's projection of what's going to be happening by 2025 I think we've all seen similar figures before and yeah it's also although if you plotted the use of NGS sequencing in clinical labs in clinical tests it wouldn't be the same numbers on your axes it would probably be following a very similar trajectory and so that's what we have to think about when we're talking in the clinic and the projected growth of genomics data is going to be up to one Zeta basis per year what is a Zeta base? it's about one to the negative 21 or something we're talking a huge huge number and in terms of the amount of storage it's going to be required 20 to 40 that's exa bases per year exabytes rather per year and again I think that's one to the 18 it's an outrageous number so we're going to have a problem this is what we have to think in the future how are we going to capture this data and want to link it to EMRs but even more than that how are we going to manage the data in the first place already when we talk about clinical labs we have to store the data and archive it for how many years that's an unknown and how can you go and retrieve a particular patient record if they want to go back and look at something else again so one method that is being thought about just partially solved part of this problem of just dealing with the data itself is something called object storage so this is where instead of a traditional file structure where you would normally open windows you look at the explorer and you can see the hierarchy of where your files are all organized by using something called object storage it's more like a database for the metadata of your information you put a file in it's all they're all the files are together and they have tags that you can then sort of search like a database I want to find the files that have this identifier this patient whatever and then it will tell you what that the ID is of your file so it's not a hierarchical structure for the data this way is much more manageable to be able to retrieve a file more easily in a more timely manner so I'll finish there by saying that it's time consuming to translate research tools into the clinical setting but if you put in the effort you can get a good payoff with the quality and reproducible results and these things are being used in clinical care more and more and it's just going to continue that way so we have to all work really hard to put that effort in up front so that we can have a good result later on thank you