 Hi, my name is David Masika and I'm a postdoctoral fellow in Rachel Karchin's lab at the Johns Hopkins University in the Department of Biomedical Engineering and the Institute for Computational Medicine. In this video I'll be presenting a study on behalf of my collaborators and myself that was recently published in Human Mutation. So let's get started. The paper is entitled, Phenotype Optimized Sequence Ensembles, Substantially Improved Prediction of Disease Causing Mutation and Cystic Fibrosis. In this study we developed a novel computational method to predict cystic fibrosis disease liability from genetic mutation. Virtually all patients with cystic fibrosis have mutation in both copies of their CFTR genes which encode for the cystic fibrosis transmembrane conductance regulator protein. So specifically we are predicting cystic fibrosis from CFTR miscellaneous mutation. Now a common computational approach to this sort of problem is a sequence based approach where you take a group of genes including your gene of interest and you align their sequences and then from the multiple sequence alignment you can infer things about substitution tolerance at different positions in your gene of interest. Sometimes these methods work reasonably well but a common challenge is determining in advance which sequences to include in your multiple sequence alignment. That is what set of sequences will allow your method to most optimally predict the phenotype you are trying to predict. So we tried to address that challenge in this work by developing what we call Phenotype Optimized Sequence Ensembles or POSES. We begin by getting a multiple sequence alignment for CFTR homologs and paralogs from the UCSC 46 way genome wide vertebrae alignments and this gives us a total of 547 CFTR homologs and paralogs. The post score function considers three properties when scoring an amino acid substitution. The self property just describes the amino acid conservation at a particular column in the alignment. We also consider the amino acid chemistry conservation and we also consider the molecular weight. The POS algorithm begins with an initial sequence pool here is those 547 CFTR orthologs and paralogs. It takes a random ensemble of those sequences then using those sequences and the score function it scores a set of CFTR mutations of no one cystic fibrosis disease liability. Now since we know the disease liability of the mutations we are scoring we can calculate the predictive value of our predictions. And in this case we are calculating the sensitivity and specificity. We repeat this entire process 25,000 times but and here is the magic every 100 times we repopulate the initial sequence pool with the top 1% of ensembles based on their predictive value so you can imagine over time the sequence pool from which we are drawing our random ensembles becomes enriched for the sequences that best allow us to predict our phenotype i.e. cystic fibrosis. So it's from this process that we derive poses which we can subsequently use to predict the disease liability of other CFTR missense mutations. To develop our CFTR poses we needed a training set of no one CFTR causing mutations and mutations thought to be disease neutral. To get the disease causing mutations we went to the CFTR2 website and to get the disease neutral mutations we curated from the literature. So we use this training set of mutations to develop the pose but then we needed a second set of disease neutral and disease causing CFTR mutations to test how well our method worked. Then we found the test set of mutations in a 2010 clinical genetics paper by Dorfman et al. And when we use the poses obtained during training along with our score function to predict blindly on these test set mutations here's how well the algorithm discriminates between CF causing and CF neutral mutations. Figure A is a rock curve and you can see we get a good AUC of 0.84 and the strip chart in figure B shows good separation between the CF causing and CF neutral predictions. This next table shows the predictive value obtained by our method on the test set mutations namely the sensitivity, specificity, positive predictive value and negative predictive value. And taking these numbers together you can see that we do a pretty good job at accurately calling the true positives and an excellent job at accurately classifying the true negatives. So next we wanted to compare our method to existing methods predicting on these same test set mutations. So here's the predictive value we got for SIFT you see a really high sensitivity but a much lower specificity for this particular system. And we get a similar result for polyphen 2 a very high sensitivity and low specificity and a slightly more balanced result using panther. And I want to point out here that we're getting this really well balanced result probably because we optimized on the sum of sensitivity and specificity. We could have chosen any one of these predictive values to optimize on. Next we wanted to see if the increased performance achieved by our method came from the sequence optimization or the new score function. So we used our score function but on multiple sequence alignments that didn't arise from optimization. So using CFTR orthologs only our scoring function got good sensitivity but pretty low specificity. We then scored each of the 12 parologous groups independently and the ABCC9 paralogous group gave us the best mix of sensitivity and specificity. Here you see decent sensitivity again and an increased specificity relative to just looking at the orthologs. And here is using all homologs. We take a sensitivity hit and a big increase in specificity. So if you look at these last three examples you see that we get more specificity out of the orthologs and a little bit more specificity out of the paralogs most likely. And it would appear that our optimization method is finding the sequences within these groups that best balances the sensitivity and specificity. So last we wanted to see if using poses optimized to balance sensitivity and specificity could be used with other methods to also balance sensitivity and specificity in their predictions. So we did that for both SIFT and polyphen and what you can see indeed in both cases is at least a slightly better balance of sensitivity and specificity relative to those methods using their native multiple sequence alignments for CFTR. So this suggests that using poses could also be useful just to develop the multiple sequence alignment and then you can go ahead and use that multiple sequence alignment with existing scoring functions. So we thank you for watching the video and hope you'll check out the manuscript in human mutation. Goodbye.