 Thanks everyone for joining in. I have the pleasure of introducing Dr. Dennis to all of you. Dr. Dennis is not a stranger to most of you at Agile India. She's been a keynote speaker before at Agile India, and also at the Open Data Science, the ODAC conference. So we've been fortunate to have her. I remember traveling to Sydney meeting her in the lobby and trying to convince her to come to India. And she was, she gracefully accepted the invite. And, you know, we've, we've now known each other for a while. She's been doing some amazing research work and open source work. The way I generally like to introduce Dr. Dennis is, you know, in the data science field, generally they call data scientists as, you know, there's a term that some, some data scientists get the privilege of, and they're called as Unicorn data scientists. And a Unicorn data scientist is someone who has deep subject matter expertise, who understands computer science and, and really knows, you know, a lot about programming and other parts of computer science. And also the third aspect is the math and the stack side of things. And in my experience, I think, you know, she scores 10 on 10 on every of these areas. So it's always a pleasure to listen to Dr. Dennis, the real work she's doing to actually impact millions of lives. So it's, it's such an honor to kickstart this conference with Dr. Dennis. So without much delay, I want to hand it over to Dr. Dennis over to you. Thank you so much for the very generous introduction. And as always, I say I am not a Unicorn because the skill sets that I have probably a majority of people in the audience have as well. And if not, this is something that is worthwhile to learn along the way. And hopefully this talk inspires you to pick up the skill sets that you might be missing. So on that, let's jump straight in and let me share my screen. So let's get started with how digital disruption in the healthcare system really streamlines, streamlines medical research. So I'd like to start with a statement that is potentially divisive in saying that technology has become as important or almost as important as clinicians themselves. And by that, I mean that it sort of has seeped into every element of healthcare technology or the healthcare system. And therefore a clinician that is underpinned or supported by technology is performing way better than a clinician who is not. But in order to enable clinicians to really perform at that high level, we do need to build those technologies. And I think everyone here in the audience probably is capable, interested and certainly needed in order to help build these technologies. So with that, allow me to introduce the research organization that I work for, which is Australia's government research agency, CSRL. At CSRL, we're really passionate about translating research into products that people can use in their everyday lives. The most famous product is of course Wi-Fi, which we co-invented together with Macquarie University and is now used in more than 5 billion devices worldwide. On a more clinical side or health side, the product I'd like to highlight here is Cardihab, which was the first clinically validated mobile app for heart rehabilitation, where a mobile technology was included in the clinical practice in order to streamline and improve the clinical care. And then of course on a lighter note together with Samri, we developed the total well-being diet, where the recipe book is now and the book best a little list alongside Harry Potter and the Da Vinci Code. From my perspective, we do have this nice balance between products that people enjoy and products that people need. So I'm clearly a bit more on the need side with working in the genome analysis with the medical science there and therefore the stories that I want to tell you today are threefold. So the first one is around how can we create digital technologies that can scale two populations, so can handle the volumes of data that is needed in order to look after a whole population in a country. The second one is in the progression of that of saying, well, if you have this volume of data, you do need to interact with that data and this is where the interoperability and the data exchange story comes in and how we can bring this really into the clinical practice. And the last one is around science accessibility and how the technology that we invent can empower innovation and really create something that is larger than the sum of their parts. So with that, everyone has mutations that should inform their clinical practice. So remember the genome, which is basically the blueprint that defines how every cell in our body, how our organs interact with each other and how our whole system functions, holding information about what kind of treatments or kind of drugs we respond to and what kind of future disease risks we have. So of course, with all that information encoded into that genome, into the three billion letters of our genome, we should be using it more and more in the clinical practice to find the treatments that work rather than having to go on a diagnostic odyssey and to do better prevention that if we have a disease risk that we might have symptoms in 10 to 20 years, we might want to do something today, lifestyle choices today in order to prevent those symptoms from potentially even happening. But identifying or unlocking this information that is in the genome is difficult. I was saying there are three billion letters in the genome and it's like finding the needle in the haystack. And typically the way that it's done is that when you want to find disease genes, so genes that influence a future disease risk, for example, you typically go and collect individuals, you look at their genome and you identify the differences between one person and the next. And typically they're on average two million differences between each individual. So you identify those differences, you collect people that have a disease and a healthy control cohort, and then you identify what in the genome actually makes the cases like the people that have the disease different from the ones that are the control. And again, coming back to the three billion letters and not each one of those positions contributing equally, it is a really complex task and we're using machine learning in order to identify these complex interactions in the genome that inform disease. Now in order to do that, in order to do machine learning on a three billion dataset times 10,000 individuals, so a trillion data point dataset, typically machine learning is very difficult to run on that because all that information needs to be kept in memory in order to reiterate over that. And we're probably all familiar with the traditional high performance compute cluster, HPC clusters, and they are designed for compute intensive tasks. But what we're dealing with here is data intensive tasks. So all that information in a high performance compute setting cannot be kept on one note, and this is basically what I'm showing here with the little black outline around the orange CPUs in them. The data that has to go from one note to the next note, it has to be programmed specifically this transaction to be handled properly. Whereas in the data intensive words that is happening all the time, therefore this exchange is really cumbersome to implement. But luckily Apache Spark has come along, which basically dissolves the boundary between the notes. So each CPU in the cluster can be accessed and it's sort of a standardized way of doing this. So built on Spark, we developed variant Spark, which is the machine learning method for genomic analysis. And we've shown that on today's data it's 3.6 times faster than the technologies that are available from Google and from other organizations around the world. But the key is that it's only increasing linearly, which means that on tomorrow's data sets, which is more than a trillion data points, it can handle that kind of data in 15 hours rather than 100,000 years which other technologies will take. So obviously with this enormous compute power, what we're doing is we're going to the cloud because the commodity hardware in the cloud is basically virtually unlimited and therefore our Spark clusters can become really large. And this is basically what I'm showing here with the architecture diagram of an AWS web service where we have the elastic map produced, the Spark cluster here in the middle is interacting with the actual data on an S3 bucket. It has security groups, VPN security around it and we access that with a Jupyter notebook. And the key with this setup is that we can just change the data that is fed into the actual framework, which means that we can apply it to, for example, motor neuron disease. Well, motor neuron disease is the disease that you might be familiar with from Stephen Hawkin who suffered from that. And the key of that consortium was to identify the underlying molecular mechanisms or the disease genes that drive the disease. And here we looked at 15,000 case controls. But what I was saying is that the data is basically independent from the rest of the framework. Therefore we can also apply it to heart disease. And here we're currently working with the largest genomic data set that is available around the world, which in this case is 50,000 case controls. So 50,000 people with the disease, without the disease. And again, the idea is can we find the drivers that are causing specific heart disease? And here the key finding was that we all know that when you have certain protein markers in your blood then you have a certain elevated risk. But the thing with this is that it found that you can have this normal protein marker that would classify you as healthy. But if you have a genomic marker that puts you at higher risk, even that normal level means that you are at increased risk of suffering from heart attack. So this is vital information in order to reclassify who is actually at risk of developing heart disease. And then we went one step further and applied it to COVID genomes. So COVID, the virus that is causing COVID-19, it was sampled and sequenced from around the world. And we now have over for almost 5 million samples available from that. And here the question was can we identify the ones that are causing more severe diseases? And the reason for that is that we know that the virus is mutating. So when it first transferred from, or crossed from bat probably through an intermediate host to human, it was a new environment for it. And it's still adapting to that new environment, which means that when it spreads from individual to individual, it picks up mutations in order to better adapt to the environment. And these then get spread on, get built upon in order to make it more adaptable to the system that it's now in. And causing potentially more infectious diseases, so makes it spread easier from human to human, or can cause more virulent strains, so causes a more severe disease outcome. All of this we want to identify, we want to identify the mutations that can cause more infections or cause more virulent strains. So we went back to that 5 million data points, data set, and extracted the ones that had clear annotation for severe disease and mild disease. Now, to our shop, we only found 5,000 samples that are properly annotated. All the rest of the 5 millions, they did not have that annotation. I'm going to go into a little bit more detail of why that's the case. So, but for the purpose of this, 5,000 was good enough, luckily, and we could go ahead and identify which mutations is associated with more severe outcome. And we probably have all heard about the spike protein, which is the one, the protein that the vaccine is developed for, because it's a thing that brings it into the cell. But the viral genome has more elements around it that are around reproduction, that are around adhering to certain membranes, or generally the function of it. And what we found is that elements outside the spike protein influence whether it is more pathogenic, whether it's more infectious. And therefore, our knowledge of that virus is actually still in its infancy, and there's much more we need to learn about it. So this brings me to the next story around data interoperability. And I already said that we were shocked to not be able to use more of that 5 million data points. And the reason for that is that when a sample is submitted to the largest database in the world, just eight, there is one field that is called patient status. And in there, there's meant to be an annotation around the outcome of the patient. But back in May, 2020, most of this was not even filled in, so it was not provided. We then partnered with Giseid to ask them to make that field mandatory, which helped a little bit in that it's a little bit more information available around hospitalization, asymptomatic and so on. But the majority was still unknown because the information was actually not available. And therefore, we partnered with them in order to help the submission system capture this data in a better way. So in May 2021, so a year later, we have a lot of information now, but still unknown is still the predominant feature. So therefore, we went in there and designed something that interacts or integrates directly with the healthcare system with a standard that is called FIRE, so the fast health interoperability standard. And here the idea is that you have the information coded, so rather than typing in free text of loss of sense of smell or anosmia or anything like that, which is sort of the same bucket of information, you can code these terms. You can put an ontology on top of it, and that makes it more machine readable. And if you do that, you might as well extract that information directly from the healthcare system. And that's basically what we've implemented and published. And hopefully going forward, that is more used in the future and can create the data, can produce the data that we can actually use in order to better understand how the mutations in the virus is actually causing or is performing in the real world and what is the disease progression of those individuals. Similarly, this information hopefully will help us to see how the vaccine is performing and see if there are any escape mutants. So where the virus has accumulated a mutation that is actually evading the immune system that is trained to recognize the spike protein, the specific version of the spike protein. But if the virus is creating another version of the spike protein, the scape mutant, and we need to know about that early on so we can update the vaccine and create a new version of it, just like in we have annual versions of the flu virus. So here the ontoserver is something that CSR has created and could help with exactly that with capturing this information in a systematic way. It is currently, it's used worldwide so it underpins Australia's and UK's national health system and it is something that I think will be absolutely central to the digital health system, to the disruption of the digital health system in the future. So with that back to COVID and how we hopefully will be able to track those mutations that either cause more severe disease or cause the vaccine not to work anymore or any other changes in the virus when it's starting to emerge and will alert us to that early on. So here we are partnered with the ITIB from CSIRR, the National Research Organization in India in order to capture their data set combined with Australia's data set in order to have this sort of the Asia Pacific regional awareness capability that if a mutation is emerging in Australia or is emerging in India we have that system that can alert us to it before it actually has gone to pandemic levels again. And this is actually, this capability is actually needed in the human health system as well and here we developed the serverless beacon. So beacon is a protocol that is very well used in the health care system already. It's used for rare genetic diseases where a clinician will have a patient that comes to them and that patient presents with a set of different mutations. Remember I was saying that there are 3 billion letters in the genome that could be changed and on average we do have 2 million differences between one person and the next and 250 of those actually destroy a certain function in our body and that is for healthy individuals. So there are 250 broken proteins if you want in your body yet you function perfectly well. But that also means that if someone tries to diagnose you they have to sift through those 250 and roll them out in order to find the 250 first that might be causing the genetic disease and for that the beacon protocol is absolutely critical because the condition can go in and can go to the cohorts around the world and can ask have you seen this particular mutation because if you have then chances are this one is actually not the one that drives this rare genetic disease because it's seen more frequently around the world and by definition a rare genetic disease cannot be common. So it is absolutely critical for rare genetic disease research to share their genomic data from around the world and the beacon protocol allows for exactly doing that. And the reason I can do that is because it's using serverless. So again allow me to walk you through how I think about serverless. So we're familiar with desktop computers where you can install whatever you want on it and it's relatively cheap to run it because it's your computer or it needs its entity and I would say this is akin to owning your own car. You can do whatever you want with it and it's relatively cost effective but the problem with that is you only have that one car and when you need a bigger car, well too bad. Similarly, you do have to look after that car you have to bring it to service and so on. If you don't want to do that and if you want to be more flexible you might hire a chauffeur and that gives you the flexibility there's no overhead because the chauffeur might bring the car to the service and potentially the chauffeur can come back with a different car but it's not very cost effective. In India it's a little bit different but the rest of the world chauffeurs are very expensive and this is akin to an auto scaling group in the cloud where you have all that flexibility of scaling up and down and somewhere else is looking after the system but it's very costly because that system is not just quickly going away when you don't need it anymore and this is exactly where serverless comes in so in serverless you can scale up and down basically instantaneous and there's no cost involved in increasing or shrinking the system yet you have all the flexibility and all the no overhead of the system that you want and the analogy will be a ride sharing app where you can request the right car the right size of the car only at the time you need it and it goes away without cost when you don't need it anymore so with this capability there's exactly what we needed in the human genomic space where there might be one ginormous query that comes in from a clinician but there might be a lot of other times where the system is not queried at all and therefore catering to this one ginormous query all the time is very costly and we don't want that but with serverless we don't have that problem so just to hammer that home in a traditional way it would have cost more than 4,000 US dollar in order to maintain this system and we brought that down to less than a cup of coffee per day $15 per month in order to serve that information to the rest of the world which means that a lot of organizations around the world can now afford to share their vital ginomic data with the rest of the world which means the information about underrepresented minorities in the world in order to really understand the full population structure of humans is absolutely vital to have every bit of information from all the corners of the world represented here and with this cheap affordable way we can enable and democratize basically the access to that so going one step further all we're doing here is we want to offer this on a population scale quantity so here we looked at the population of the US 350 million samples individuals with their 3 billion letters in the genome which means we're dealing with one quintillion data points cohort or data set and we want to process that in real time and with serverless beacon we can do that in a second so this brings me to the last story around science accessibility now with Covid in general but cloud is specifically the talent around the world and the solutions around the world have become more accessible it's basically a bridge between the developers and the people that have the need for a certain task and specifically I would argue that it bridges the gap between research and industry and we've been using exactly that for the past five years where in 2016 we developed the first serverless application that demonstrated that it can handle something as complex as a research setup and in 2019 we brought a digital product to the AWS marketplace sort of the first health product on the AWS marketplace and in 2020 I became a data hero, Australia's first data hero and the only data hero that operates outside the IT space in the academic research setting so if I sort of see myself as a conduit of bringing the research brilliance into the industry setup so coming back to the industry, the health product that we put on the AWS marketplace on one hand it's a straightforward commercial narrative of having an academic product put on the AWS marketplace so that everyone in the world can use this product and can handle it or can execute it on their data but there's also a second narrative around data reproducibility in that it is exactly the same setup that people can spin up at the press of a button through the infrastructure as code and everything the infrastructure that I showed you before the architecture is basically encoded in one single file and people can press go on it and it automatically creates that architecture in their account which means the architecture the data handling, the workflow everything is already there and all they need to do is pointed to their data and I think for the purposes of India, I think this is really interesting because not only is it reproducible but it also makes it easier to build upon larger systems basically to stand on the shoulders of giants and string those individual components together in order to build something that is larger than some of their parts so to summarize there are three things to remember from my talk is that disruption in the healthcare system really requires us to scale to these massive workloads like crossing Salomon was saying that by 2025 which is just around the corner half the world's population will have been sequenced and that is a massive amount of data that we're dealing with here it's larger than Twitter, YouTube and Astronomy combined so the traditional big data disciplines combined and for the healthcare system unprecedented, it's scary it's exciting, it's definitely something new and therefore having technology experts like yourself help out in that system I think is absolutely crucial we've dipped our toes into that with Variance Bar which is this genomic system that can handle a trillion data points in genomic data sets to identify disease genes and we've demonstrated that on the COVID for example the COVID data set looking at which mutations in the COVID data set can cause more severe disease the second part to that is that the disruption will come from data be shared globally because once you have sort of this gold mine of data you can't help on it, you do have to share it in order to really get the benefit from other data sets around the world as well and here we develop the serverless beacon approach which allows data to be shared in a more democratised approach where even smaller organisations around the world can contribute their genomic data of their rare population in order for the conditions around the world to get a better understanding of how the human genome actually functions and of course the onto server will be a large part with the fast interoperable health system, the fire setup in order to have the data, the medical data in a more ontology in a more machine-readable setting to share directly from the healthcare system but I think the biggest change in that system is that rather than bringing the data to the compute it will be the other way around, the compute the analytics needs to go to the data and therefore we need to come up with systems that are flexible, that are agile that are federated in order to cater for a world like that where the data is going to be all over the world distributed in their own little buckets but we need to bring them together in order to make sense of it so distributed machine learning around that as well and the digital marketplace is one aspect that can cater for that because it allows you to create a system in your environment over here that can then be rolled out to the other elements or the other systems around the world in a reproducible way and in a way that allows people to build on top of it so we can create systems that are larger than the sum of their parts so with that I want to say thank you for listening and if any of this sounds interesting I encourage you to go to our webpage bioinformatics.csrl.au because there are many more case studies and many more examples of how we could work together from a technology perspective from a domain perspective in order to make healthcare go one step further with that thank you very much Wow awesome what a great way to start the conference thank you Dr. Dennis didn't expect anything less than that and very nicely summarized the last three points like the key of your the experience that you've had and in the past you've obviously shared other case studies as well that you guys have been working on so it kind of builds on top of that so and also something very relevant as we're going through COVID but I was just mentioning in the chat that there's so many organizations at least I know that are still technology organizations but still afraid to move to cloud and you guys have not only moved to AWS five years ago but also kind of pushing the envelope on server less and a lot of open source contribution as well from your team so it's pretty awesome so I would again encourage everyone in the audience who's interested in this space to help out with this with this noble cost so this is AI for good kind of stuff so bring it over and would be great to help advance medical science and humanity so I now quickly switch over to questions I see we have five questions for you Dennis so I'll quickly go through them we have sufficient time so nothing to worry folks if you have more questions please pour in we will take 10-12 minutes more to go so we will take a good number of questions and of course Dr. Dennis will be available after this in the hangout section for you to have a face to face interaction with her so with that let's jump to the very first question we have which is how is data security managed in serverless particularly on data sharing and basically restriction on storage of sensitive data I'm sure you get this question a lot yeah this is a fantastic question for sure so obviously one element around this is that the data at risk needs to be protected and needs to be encrypted which means that it only leaves the data in transit and in serverless the transit is quite frequent because you know it goes through all of those systems and how can you protect it on all of those systems so there's definitely the catch in that in the past when you had the lantern functions in AWS and it's the same in the other cloud providers as well that you don't exclusively own a whole machine you share it with other people and the data that is copied in there in the temple and so on it is theoretically accessible you know with quirks and things like that by other users as well now AWS and the other cloud providers have recognized that problem especially for sensitive data so therefore you can now specify that the lambda function need to be exclusive need to be exclusive on one on one machine which means you reserve the whole the whole machine and the whole memory of it and you can purge them afterwards the actual information then so I think from my perspective we have come a step further but you're absolutely right in terms of the data security that you have the full control you don't have the full control you don't have the same level of control over where the data is especially when your systems are failing like when they're erring out and things like that then what you would have with say a virtual machine so I think this is still something that as a serverless community we need to be mindful of and we need to we need to still be working on those elements around it but I think we've come a long way already in making it exclusive looking at what the error files are and so on so I'm confident that we'll get there we are probably 90% already there and it's just the edge cases that we need to now work on Perfect alright thanks Ashut Narayan for that great question I will move to the next one so we have a question from Rajit he's asking what database are you using for such large datasets for ML workloads yeah excellent question for sure as well so we started with DynamoDB because once you go serverless you never go back but we failed because as DynamoDB and same thing with the other setups they are designed for a certain amount of data input and output and handling in general so I have to say most of the stuff that we're doing is still a flat file because in the human dynamic space the indexing that we've done that was created or that was developed on that is already so good that a database with a database scheme is not really adding any benefit to that in saying that we tried out Athena a while back sort of as a conduit between flat files and a database and that wasn't for us that wasn't working either so at this stage it's really not a database it is an indexed flat file that we're working with cool so we can go delete all our databases cool thanks Rajit for that question next we have two back to back questions from Pradeep I'll start with his second question which is how come a doctor became tech savvy I think it's out of necessity once you're dealt with when you're facing this task of having to analyze that data there's no other choice than to skill up otherwise you're completely failing and you're completely failing the patients that have donated their data for this greater good so I think from our perspective there is this necessity but in saying that I was really lucky in partnering with the experts who really pointed me in the right direction and who helped us develop something that is robust enough rather than put together with sticky tape which probably would be what I would have done cool great awesome just moving on to the next question from Pradeep he's asking any reason you chose AWS over other providers I think the short answer to that is we have not been working with Azure as well as Google and to some extent Alibaba in the quarantine on Chinese market so we are working with the three main cloud providers now the way that I see them is that it's horses for courses like AWS is really when you look at the magic quadrant of Gardner AWS is the market leader in terms of developing new technology quickly so if you want to have something that is really absolutely cutting edge and want to use the latest news technology you probably can't get past AWS in that thing when you want to work with the healthcare system though there is a different emphasis on robustness of the data that you want to have something a throat to choke if you want so there's a different a different quality control around that and Azure is really good for that it makes everything slow and tedious but in terms of the robustness that the healthcare system sometimes requires they can definitely cater for that element so therefore if this is your priority choose Azure in terms of Google I think they do have a lot of engagement with systems here and there they do have a lot of data sets available on their system so that is if that is your priority area of being engaged with the rest you know with the rest of the system in terms of the data ecosystem then Google might be your priority so I think from a perspective again it comes back to Horses for Causes and we're trying to be cloud agnostic so we typically have our cutting edge development on AWS just because we want to see whether it's possible to do it and once it matures a little bit it probably moves across to Azure and then Google is sort of on the side as well and I think from a perspective if it's especially in the genomics space Google is probably the one that is currently leading because the Broad Institute has been supported by Google for so long so yeah multi-cloud that's going to be the future okay cool I believe the next two questions the next two questions are all pretty much answered by this so there are similar questions around the platforms and stuff like that I see an interesting tricky question I would say from Hemant Hemant is a little skeptical that you know he says I get an impression of immensely relying on technology for data normalization while you know these things may fail sooner or later so while while in turn making checks for additional mechanisms while storing so I think he's looking at what other kinds of things you have in place if these things fail and how you're dealing with laws like GDPR and HIPAA compliance and so forth yeah so we are that's a very good question so we are working in the research space still which means we have the permission from the patients to use the data which is which rules out GDPR because it's all anonymized in terms of the HIPAA compliance it's again it's something that is where the there's additional information stored it's just the information that we need for our research project at this stage in saying that of course moving forward when we bring it into the healthcare system all of this is relevant and I would argue that GDPR and HIPAA is not going far enough because there's this whole can of worms in the genomic space where you are genome yes it's your genome but it also gives information about your family because they share out with your genome therefore whatever you decide with your genome there is the risk that you're exposing people that are related to you as well so the golden gate killer is probably the best examples for that where they actually caught that person because one of his relatives uploaded their genome to Ancestry who had their genome processed by Ancestry the police found them across that link so I think in the future it will be it will be people will become more savvy around what kind of data is where and how do you share the data so therefore I see this dynamic patient consent problem in that the patient gives consent to the data being used under certain circumstances and that can be updated at any moment at any point in time which means that your systems need to work with data coming in and coming out being in flux all the time and I think there is going to be a really difficult complex space with say self-suffering identity and things like that that we need to take into account as well so yeah I definitely hear you that we're currently relying too much on the existing technology and there need to be new technologies and new mindsets I think developed of that the data is owned by the patient rather than by a consortium Perfect and this is where we hope all of us can do orbit to contribute towards this because these are I would say bleeding edge tech that all of us should be trying to help out with so great question again Hemant and wonderfully handled Dennis I think we are pretty much out of time so again thanks Dr. Dennis and thanks everyone for joining in