 Hello my name is Aaron Lee. It's a real privilege and an honor to give this Grand Rounds to you and I have to apologize for giving it virtually. There was a misunderstanding on my part on when I needed to be there in Salt Lake City and then by the time I realized it was it was too difficult to get call coverage for Tuesday night. So again I apologize for giving this virtually and if all is going well I should be arriving into Salt Lake City as you listen to this Grand Rounds. These are my financial disclosures. I'm going to break down this talk into four sections. The first is an introduction to some of the concepts that I'll be talking about and then the second part will be the beginnings of how we got started in this area of artificial intelligence and deep learning. The third will be about what I think is a very important study that we did around autonomous AI and then finally I want to end by talking about future directions. So I hope all of you know that we are firmly living in this era of big data and machine learning. We as human society are living in a very different time than we were even 20 years ago and what I mean by that is that we are living in a time where the information that is flowing all around us every day is being captured, harnessed, and then used to train and manipulate very powerful machine learning algorithms and these algorithms are indirectly or directly influencing things that we see and perceive in the world and it is really the first time that human society is being so heavily influenced by computer algorithms. And I always like to start these talks also by disambiguating some of these terms. AI, machine learning, and deep learning are unfortunately used interchangeably in the news media. They are used in a way where they appear to be synonymous but they actually have very formal definitions. So the field of artificial intelligence is actually very, very old. It has been around since really the dawn of modern computing and it is a field that describes the idea of intelligence being formed by something man-made. The field within artificial intelligence is machine learning and then there's a subfield within that called deep learning and the field of deep learning is relatively new. So what is deep learning? Well, to understand that you sort of have to understand what an artificial neural network is and this is these class of machine learning algorithms that were very popular in the 60s and 70s but at the time a traditional artificial neural network could not really be extended to more than three layers. So in my opinion there's four advances that led to the birth of what we call deep learning today. First it was a realization that computer graphics cards could be used to vastly accelerate linear algebra operations. It turned out that the mathematics behind rotating and rendering polygons in three-dimensional space for computer games was actually very similar to the math that was used to train deep learning models and so all of a sudden people had in their computers a mathematical co-processor that could be harnessed to use and train these very large deep learning models. The second was the use of these filters known as convolutional filters that could exploit the local coherence between pixels. These filters actually existed since about 1980s but they could never be implemented because they were too computationally complex and so when the GPUs were available to be used for accelerating these operations convolutional filters actually became feasible to implement computationally. The third was the use of non-linear activation functions and the final one was that our computer hardware and architecture had evolved to a point where the storage of large data sets became possible and so you had both the computational means and the capacity to store and process large data sets that could be used to train deep learning models and that all led to this very exciting time that we live in today where as we increase the size of the deep learning models and we increase the amount of data that they are fed the performance just seems to increase and increase and this is one of the reasons why even today every hour there's a new article about chat GBT or the large language models doing something that had never been possible before and this gets to you know this time that we live in which people are calling the fourth industrial revolution where now these AI methods are powering pretty much everything that we do every email that we write every search engine query that we type in there are very very powerful machine learning algorithms that are driving the content in terms of what we read and we see and on the biological side we've reached a point where in molecular biology and the omics world are able to generate huge data sets that could be used so we have this convergence of having oceans of data available to us and the AI methods coupled to analyze them so I want to walk you through our journey into how we got into deep learning and this goes back to when I joined the University of Washington one of the first things I did was I extracted all the OCT imaging that was available to us at our hospital and that was akin to about 5.5 million OCT B scan images on about 16,000 patients for more than one decade of time and then from the EHR I had access to all the clinical variables I could match up each OCT to the visual acuity the diagnosis whether they had laser or intravitual injection and the OCT expert interpretation and for a while I was sitting on this data set not really knowing what to do with it because at the time there weren't really scalable algorithms that could be used to collect and sift information from these OCT images so one of my friends told me that I should try this thing called deep learning and I said oh you know that's a great idea but I really don't think it's going to work I think it's just a fad it probably just passed and he said no no no you should really try it let me put you in touch with some of my friends at NVIDIA and NVIDIA I told NVIDIA about my data set and they said you know that sounds amazing why haven't you tried deep learning yet and I said well I don't have one of your fancy graphics cards and two days later they had shipped a one of their top end graphics cards to our lab from Taiwan and we constructed our first deep learning experiment and this was in my mind sort of the easiest problem to get started with and it was whether deep learning could distinguish between normal OCT images and those from AMD we constructed a data set of about 100,000 images and we trained this model at the time which was that was state-of-the-art called VGG 16 and we got these results so about a week after getting that graphics cards from Taiwan we were able to achieve these sorts of metrics and I was blown away to be perfectly honest with you I really did not believe or think that deep learning was capable of doing this and I thought I had made some sort of embarrassing you know beginner mistake in our experimental setup and I was very nervous about publishing this and so this sort of gets at the problem of these models being black boxes even today we struggle with this problem of trying to understand why or how these deep learning models are working I dug around the computer vision literature at the time and I tried to find some sort of way to understand what what the model was doing how he was able to achieve these sort of incredible you know AUROC's and sensitivities and specificities and I found this visualization method that even you know today I still use because I think it's very intuitive the way it works is if I give you this picture and I ask you do you think there's a ball in this picture you would say with a hundred percent certainty that yes there is a ball in this picture if I then cover up a very small part of the picture and I ask you again is there a ball you still say yes but if I keep moving this box around and at every possible pixel position in the in the picture I keep asking you is there a ball in this picture eventually the box ends up here and I will ask you is there a ball in this picture and you'll say I'm not sure anymore and so that's what we did we took OCT images that the model had not been trained on and we systematically occluded every possible position in the B scan and we watched this happen watch what happened to the probability of the model calling this macular degeneration and if the probability dropped a lot then we would highlight that region that pixel position with with with you know a high value and when we and so when we did that on these three B scan images we were able to generate these heat maps that showed that the model was actually looking at areas that were clinically relevant and that made me feel a million times better about the model that we had trained that it wasn't some silly mistake that I had made but the model had actually learned to try and distinguish the the relevant features between what was normal OCT B scans and what were abnormal OCT B scans. From there we went on to our second study and this was a study that we published in Biomedical Optics Express that has to do with the segmentation of intra retinal fluid on OCT scans and so here's an example of the deep learning model where on the left you see the original B scan images and on the right you see the areas actually the the confidence of the areas that are that the model called intra retinal fluid and at first you know I thought this these results are really exciting and so I went and showed you know some of the computer vision folks and said they said oh Erin this is great but this is actually a very simple problem all you have to do is find the ILM which you know there's tons of good methods of doing that and then you find this bright band here and everything in between that's dark is intra retinal fluid but if you follow that heuristic you know of course the shadow underneath vessels could erroneously be labeled as intra retinal fluid whereas in fact the deep learning models are actually able to not be fooled by these dark areas that are artifacts of the shadows cast by the vessels and instead correctly label the areas of intra retinal fluid. In the paper we showed that the model was as good as clinician to clinician variants but more importantly there was this video that was hard honestly to put into the paper and what we did here is we really slowed down the rate of the deep learning models learning to do this task and so if you remember this was this B scan that I just showed you that has this area of the of the area shadow underneath the vessel and if you watch this model learn to do this it will at first call this area underneath the vessel intra retinal fluid but as it is fed more and more examples and it's trained over a longer period of time you can see it getting better and better and better and you know it learns that even sub retinal fluid is not intra retinal fluid or over time but there's something that happened that was sort of hair raising at the beginning of this video now play play it again for you and and then it'll pause so here right when it's starting to you know go on this exponential growth in learning it it it actually learned to do something different than what we were trying to teach it to do if you look the areas that it's highlighting is not the intra retinal fluid at all but actually it's it's it learned the the organization of the retina it realized in order for it to learn where the intra retinal fluid was it had to understand these anatomic boundaries first and what I find amazing about that is that the model the deep learning model it essentially had identified a sub problem that it had to solve on its own in order to solve the bigger task of of intra retinal fluid and I kind of view this as sort of an example of emergent AI behavior so you know what do we take away from these first two experiments we learned that you know machine learning is always limited by the ground truth so what I mean by that is that if clinicians disagree on on whether you know a certain OCT scan has CSR versus you know wet AMD then the model will always be confused it will never do better than the average human confusion around the label of for the ground truth but it seems to also have this incredible potential like you saw in the video it it has this incredible flexibility to work with me you know different imaging types and and also identify sub problems that it needs to solve all on its own and you know our lab and others started to think about how you can push AI and deep learning to do more and that sort of led to you know the many many different experiments and papers that we published afterwards that showed everything from being able to predict in OCT a scan from regular standard structural OCT or you know predict you know outflow facility of the terbecular mesh work from the Myers of a Goldman Tynometer and so it's really has you know inspired the field to do quite a bit but I would love to spend this next section talking about a very important topic in our field and that has to do with autonomous AI and what I mean by autonomous AI is sort of if you think about the self-driving car industry there's sort of level zero or one level you know algorithms that help us drive our cars a little bit safer but the driver is doing everything all the way to level five self-driving cars that you get in the car and you can sleep or take a nap and it'll be fine because the car is doing everything completely autonomously and those same levels of you know tesla car automation or autonomous car self-driving cars exists in the medical domain and they're broken out sort of in these sort of ways where you know level one is sort of data presentation and level two is clinical decision support and those are the assistive AI algorithms and the autonomous AI algorithms are level three and level four level five where full automation is basically you know AI algorithms that are being used in you know populations where there's no human clinician in the loop at all and they're making you know a fully autonomous medical decisions and so what's kind of amazing to me and to many others in the medical space is that somehow in ophthalmology we went from level zero all the way to level five and we have very few AI algorithms in between whereas in the field of radiology it's the reverse they have many many AI algorithms that are up to level two but almost no algorithms that go beyond level three and so that led you know Eric Topol who's sort of a famous medical technologist to tweet this out about how you know most people think radiology is leading the AI movement but it's really ophthalmology that is leading the AI movement and that's because of what's happened with in the diabetic retinopathy space in the diabetic retinopathy screening space there are now today I believe three or four FDA approved algorithms that are fully autonomous and they are making clinical medical decisions at the population level with no clinical oversight so you know I think there's something known as this Gartner hype curve and it's really important to you know think about you know where we are on this hype curve it's really unclear to me whether we have reached the peak of inflated expectations or not with regard to AI and medicine if anything this this peak seems to be growing ever higher so we may still be on this ascending limb here but it's very important in this time space when they these algorithms are being deployed to understand you know where where the real plateau of productivity is so we embarked on a study a few years ago to compare the performance of seven AI models diabetic retinopathy screening models in a real-world screening setting and so the reason why we did this is because there were many many papers have been coming out by you know these the companies the commercial entities in this space claiming that they had you know amazing performance and even yeah even if you trusted them that they had you know done everything ethically and they had no no influence on the on the on the final test performance of their models you still had no way of comparing the the algorithms a versus b in a head-to-head fashion so we wanted to be able to compare them in a in a with equal sort of on an equal playing field so what we did is we reached out at the time to all the companies that we could find that were working in this space and we after discussing with them all five ultimately agreed to be part of our study we invited every company to submit up to two different AI models to for our study what we then did is we went to the VA tele-retinal screening system and from two different sites one in Seattle one in Atlanta we extracted all the imaging that was available from the tele-retinal screening system and we combined them together to create a full data set of about 311,000 images that had all been labeled by the original VA tele-retinal grader during routine clinical care in a subset of about 7,000 images we went through an arbitrated three retina specialist grading arbitration to arrive at a final final clinical grade for each of these diabetic rin hoppy images and this allowed you know basically us to compare the seven different AI models versus the original tele-retinal grader in against this arbitration set so these are sort of the headline results of our study we had these seven different algorithms and we showed that all of all the algorithms exhibited you know high overall negative predictive value and in a screening context this is absolutely what you want you absolutely want to make sure that your your negative predictive value is is as high as possible unfortunately most of the algorithms showed a low positive predictive value meaning that there were a lot of false positives being referred in if this was actually deployed you know in the clinic in my opinion you know this is one of the most important results from our project where we again use the arbitrated set to compare directly the VA tele-retinal grader in a pairwise fashion to each of the seven different algorithms and what we were very pleased to show was that the VA tele-retinal graders are actually doing an amazing job they had a hundred percent sensitive sensitivity for moderate or higher of npdr and then algorithms e f and g were statistically similar to the VA grader for moderate npdr or higher meaning that they would would not behaved almost statistically indistinguishable you obviously cannot do better than the VA grader since they had a hundred percent sensitivity but being indistinguishable to the VA grader in a large dataset with a pairwise fashion means basically that they made very few mistakes and so in my opinion these algorithms e f and g are safe for deployment in the VA tele-retinal screening system you know some of the limitations of this obviously it only really applies to the context of of the VA we did notice that the dilation it was important to reduce the rate of ungradable images and the algorithms varied very widely despite having regulatory approval or being clinically deployed somewhere in the world and this gets to this issue that I think it's actually very important to assess these models with an external independent validation especially if you're planning to deploy them and pull away the human clinician all together from from the medical loop so I you know I do believe that these models are very powerful and are ready for deployment but it makes sense to make sure in a small subset that they work in your clinical informatics system at your hospital one of the criticisms of the diabetic retinopathy AI models is that they are doing an amazing job for diabetic retinopathy but there's a lot more that is covered in a retinal color color fundus photograph of the retina than diabetic retinopathy and when you're using a tele-retinal screening system for diabetic disease the the readers the human readers are actually also screening for other conditions other incidental findings that would require you know referral and so there's a follow-up study that we're doing in partnership with the CDC there were two years during for the NHANES study where they had collected retinal images and they all of them were graded by the Wisconsin Reading Center at that time and the goal of this study is to do a more comprehensive screening for not just diabetic retinopathy but also AMD and glaucoma and to do this nine companies have agreed to send their models for evaluation and that means shipping a workstation that can work without the internet and being only connected to power to the CDC RDC where you know I'm not allowed to even bring in a cell phone to the to the room to do this study so this study is ongoing and we hope to be able to communicate some of the findings of this later on this year so I do want to spend a little bit of time talking about the future directions and in particular one study that we are starting up at University of Washington that I hope will you know be revolutionary for the field so if you take a step back and you ask yourself you know what are the datasets that are ideal for deep learning whenever I give these talks invariably people come up to me afterwards and they're excited because you know they have a data set that they've been collecting and working on for their for many years of their life and they think deep learning might be you know the perfect solution to their problems and so when I start talking to them about their data set I in my mind I'm constructing this graph where the number of measurements are on the y-axis and the number of subjects are on the x-axis and almost always the data sets I hear about are in this quadrant of the of the Cartesian coordinate plane and it's you know a couple hundred or even you know 500 patients where they have done a ton of different different different measurements on them and unfortunately in machine learning there's something known as the curse of dimensionality and this makes it very hard actually to do deep learning research with what you really want is almost the reverse of this you want a lot a data set where there's a lot of independent samples and relatively few key measurements of course if we had infinite time and if infinite money we'd all be living in the upper right right quadrant but but reality often precludes that from occurring so you know the data sets that characteristics are that one that it has to be large you want diverse pathology captured in your data set you ideally want your data set to be balanced for sex race and ethnicity to make sure that your AI models are not biased in some way for for or against any particular sex race or ethnicity and as I was explaining earlier the vast majority of deep learning advances have occurred in the field where either images or waveforms or or language is used and in those three sort of domains there's local spatial coherence that the models can exploit. One of the most successful data sets for doing machine learning and deep learning came from an effort in the United Kingdom called the UK Biobank and the UK Biobank was started in you know 2006 to 2010 where you know approximately half a million people they had asked them to come in and they started to collect all sorts of data over the years and it has really generated one of the most amazing resources for doing data science today. It's been a tremendous success 15 petabytes of data have been generated there are 2000 ongoing research projects around the world 1400 papers have been published and 200 of them have been published in the Nature Family journals so really really impactful and really the field of medicine took a substantial step forward because of the UK Biobank. One of the seminal pieces of work in our field came from the Google group where they used the UK Biobank and they showed that using from a color fundus photograph they could predict the age, the sex, smoking status, hemoglobin a1c, body mass index, systolic blood pressure and diastolic blood pressure all from a color fundus photograph and later on they showed even that they could predict you know the hemoglobin concentration and therefore the anemia status from a color fundus photograph so really amazing work and it sort of spurred this idea that you can you learn so much about the human body through the eye and the eye became sort of a center point for doing deep learning in the field of medicine because the images were easy to obtain they were they did not require radiation they were required instruments that were you know on the scale of 10 or attended to $20,000 but there's a couple really big problems with the UK Biobank and there is this is not really a criticism of the UK Biobank effort but an unfortunate reality of how the UK Biobank was set up first to reach the number of half a million they had to do convenient sampling where they set up recruitment centers in the United Kingdom and anybody who is willing to walk through doors they they took into the study and unfortunately that led to a situation where there was a healthy volunteer bias the people who are willing to walk through the doors were often much healthier than the normal participant in the in the UK population and unfortunately it was very heavily white British in fact it's about 95% white British and and to illustrate the healthy volunteer bias this graph I think speaks volumes where they showed the mortality rate for people in the UK population versus those in the UK Biobank and on average the being a UK Biobank participant your life expectancy was about 10 to 15 years longer than the same UK population that the participant came from so clearly you know everything that we've learned from the UK Biobank essentially has been on very healthy white British people the other sort of problem with the UK Biobank is that it has a fairly onerous path in order to get access to the data set when we try to get access to the data set at University of Washington it took us about three years before the data sets actually arrived on on our hard drives and so in the criteria that is known as fair or findable accessible interoperable and reusable I would say the UK Biobank does a fairly good job at being findable is decently interoperable and is obviously very reusable but is not as the most accessible data set in the world so that leads me to the the NIH so the NIH saw the success of the UK Biobank and similar efforts and they realized that there was a real need they realized that the US should set up something similar to the UK Biobank and they created the all of us program which is modeled in in spirit very similar to the UK Biobank and then they created the bridge to AI program the bridge to AI program is a common fund program and it's really charged with generating flagship AI data sets for medicine so Cecilia and I at UW during the pandemic marshaled our vision for this in about two months and we submitted about an 800 page application where we thought we had close to zero percent chance of of getting it somehow miraculously we ended up being one of the four data generation projects that was ultimately funded and the entire program has about 110 million dollars over four years to generate these flagships data sets for medicine so I would tell you a little bit about our project our project is called AI Ready and Equitable Atlas for Diabetes Insights and it is really targeting type two diabetes as a model disease for studying and generating a flagship AI Ready data set the ultimate goal of our data set is to create a multi-dimensional ethically sourced data set in diverse people for studying solutagenesis in type two diabetes if you're not sure what the term solutagenesis means hopefully this will make it a little bit more clear we spend a tremendous amount of time and and money trying to examine the concept of pathogenesis of or how you go from a healthy state to a disease state and we try to create interventions that either slow or halt the progression of going from healthy to diseased but we actually spend very little time thinking about the reverse of this and that's known as solutagenesis which is the the promotion of the of the human body back to a healthy state so to study this we assembled a very large team expanding across the United States the NIH broke up the British AI into these six different modules we have a teaming module being led by Mike Snyder who's the chair of the genetics department at Stanford as well as Sarah Singer who is a team science professor in the School of Medicine we have a skills and workforce module being led by Linda Zangwill and Sally Baxter at UCSD we have an ethics module being led by Alvin Lou and Kadicha Ferryman as well as Megan Collins from Hopkins and a team at UCSD we have a tools module being led by Bevesh Patel at Calmy as well as folks at OHSU and then a data module being led by Cynthia Alsley and Jerry McGuinn and you might notice that Joe Ricciata is at several different parts of this project and that's because he is leading an effort with American Indians on our project and finally we have a standards module that is trying to find and define representational standards for encoding the data set being led by myself as well as Christopher Shoot at JHU. So what we're trying to do is borrow an idea from the single cell RNA-seq to study soluted genesis and what happens in single cell RNA-seq is that there's a population of cells that is sampled and the transcriptome of each cell is characterized individually using single cell RNA-seq technology. Then if you apply machine learning and do a dimension reduction using something known as a UMAP you often can generate these graphs where one side of the graph might be these cells in an embryonic stage and then you have different branches as they go through a differentiation into a terminally differentiated state and they follow along these nice little lines where two dots close to each other are basically very similar to cells that are very similar to each other in transcriptome and two cells that are very far apart here are two cells that are very different in their transcriptomics. And so what we're proposing to do is collect a data set that is diverse and wide enough and captures different aspects of the human body in different states to allow the participants to hopefully fall along these manifolds where one axis will be pseudotime and then the other axis will be healthy versus diseased and then these manifolds will allow us to study both pathogenesis and pseudogenesis. In order for this technique to work it's very important that the data set is balanced with respect to the disease that you're interested in and so we broke up type two diabetes into these four bins. One is no diabetes at all. A second is lifestyle controlled or pre-diabetes and the third is oral medication controlled and the fourth is insulin and medication controlled. One of the big hopes and goals of our data set is for it to be balanced with respect to race and gender excuse me sex. And so what we hope to do is collect a data set where a thousand of the participants will be white, a thousand black, a thousand Asian American and a thousand Hispanic. Within that a thousand of them will be normal, a thousand lifestyle controlled, a thousand oral medication controlled and a thousand insulin dependent and within that they'll be one-to-one balancing of males and females. We'll have three data collection sites, one at University of Washington, one at UAB and one at UCSD and as I mentioned Joe is working with his team at Native Biodata and they'll be facilitating discussions with the Cheyenne River Sioux Tribe to collect a parallel data set that will be constructed in a similar fashion. The data set, the variables that we're actually collecting, well of course we'll collect a medical history, we'll perform vitals, we'll get a whole host of different blood work and urine, we'll perform surveys including social determinants of health and depression, we'll do a battery of cognitive function testing, we'll obtain a 12-liter EKG, we'll do different elements of visual function including visual acuity and low-contrast sensitivity, we'll obtain many different forms of ocular imaging including color fundus photography, OCT, OCTA and even hopefully FLIO. It's of note for these first three different imaging we're actually going to use devices from three different manufacturers in order to hopefully build a translational data set that will go in between these different devices. We're going to send them home with an Apple Watch to collect activity and heart rate and a continuous glucose monitor for 10 days and we've built a custom environmental sensor that can measure the pollution levels including PM1, PM2, PM4 and PM10, nitric dioxide and volatile organic compounds and the environmental sensor actually also has a spectrometer to measure the distribution of wavelength of lights that are present inside of people's homes to hopefully understand how that might affect circadian rhythm. And then we will bank a whole bunch of blood including plasma and serum and then we will store the blood in a way that will allow the generation of IPSC cells and organoids in the future. We hope to perform whole genome sequencing if the cost of whole genome sequencing comes down and most ambitiously we plan to open source our data set so we hope that we will be able to open source almost all the domains that you see here on the screen in a fashion where anybody with an internet connection anywhere in the world can click and download this data set and start using it for discovery. We have a lot of goals and challenges that we have to overcome in order for this project to be successful. We have a group of people who come from non-orthomology domains and have never worked together before and have a very tight timeline to deliver on this and so we'll need help with team science. We want to promote diverse perspectives, consider all the ethical consideration, engage the community. We need to build standards where they don't exist especially in some of the ophthalmic imaging domains. DICOM is ideally the format that all the imaging should be in. We are building a cloud-based platform for sharing and accessing the data and we want to train the next generation of AR researchers so that they can use this data set. So our project is really everything that you see in this sort of green box but what we hope that others will do as we release this data set is that they'll use this data set to do sort of hypothesis-generating type research or biomarker discovery and then if the manifolds are constructed then they can even do predictive AI modeling where they can study things like therapeutic targets and salutagenesis. We have this vision that any new data that's generated will be deposited back into the open repository. We have an ongoing engagement with American Indians and we hope that many other people will use this data set for their research and we really want to sort of accelerate the field of data science and medicine as a result. So this is our project. That's our website and that's our QR code. You're welcome to sort of follow along and if you're interested to be involved definitely let me know and I want to end on a very positive note that I sort of believe that the future of data science and eye care as well as you know vision science research is very bright mainly because of Michael Chang becoming the director of the National Institute and Michael Chang is known for many many different things but he is also an informaticist and well known for his work in bringing deep learning to retinopathy of prematurity and so I'm sort of hopeful that his vision for where the National Institute goes is towards tarnishing data science and I want to end by thanking all my funding sources as well as the members of my lab all of which you know none of this research would have been possible without them and I would be happy to take questions as my clean lands in Salt Lake City. Thank you very much and again I apologize for not being there in person. I will be on campus shortly if all goes well.