 Hello everyone and happy International Women's Day. So I'm going to have a couple of slides where I have tiny, tiny coat on there so I invite you to come forward. There are still plenty of seats in the front, but if not, it will be a throwback to your childhood where you sat on the floor and listened to stories of witches and dragons. Now I can't offer witches and dragons, but I hope to show you some technological marvels that actually see and put in magical. So with that, I want to tell you three stories. So the first one is about CSIRO, which is Australia's Government Research Agency that I'm working at. The second one is around the technology that we use in order to find disease genes or find out which genes actually cause certain diseases in humans and that will be a purchase back application. And the third story is around how we use serverless architecture to really build sustainable and change research workflows, particularly in the genome engineering space. So a little plug. The last one will be very short because tomorrow I'll have a longer session around serverless architecture in general, which will feature GT scan, which is the technology, but I also have a couple of other cloud patterns that I want to talk to you about tomorrow. So with that, let's jump into what and who CSIRO is. So as I said, CSIRO is Australia's Government Research Agency. They have about 6,000 or 5,000 staff. Most of them are PhDs, so research scientists, and we are in the top 1% of global research agencies. But I think what really makes us unique and special is that we're very passionate about translating research into products that people can use in their everyday lives. So what are one of the interesting or well-known products? Well, we invented Wi-Fi. So the past modern Wi-Fi that you have in your devices was actually invented by CSIRO. Other areas are, for example, fluid treatment, as well as the polymer bank notes. So this is plastic money that you can, of course, use in order to go surfing, which is why it's appropriate that it is coming out of Australia. So I am part of the eHealth Research Centre, which is the largest digital health centre in Australia, part of CSIRO, and our Wi-Fi equivalent technology that we invented is CardiHub. So CardiHub is a little mobile app that is used in hard rehabilitation. So once you have a stroke or another adverse event in your life regarding your heart, you usually go into rehab. And you would think that having a heart attack or something like that would actually convince you to change your lifestyle while it turns out that it doesn't. So this little app has actually increased the uptake of rehabilitation and the completion rate of rehabilitation by 70%. So I would say this little app already has safe lives. So with that, my area is genomics, which is around variant work. So what is the genome? The genome holds the information of basically every cell in your body. So it holds the information how to build your skeletal muscle or your skeletal system, your muscle system, heart, lung, brain, you name it. It's all encoded into a 3 billion-letter-long string. So as such, it's not surprising that it regulates many other things, not just the architecture of your body, but also how you look, what diseases you have and what behaviors you exhibit. So usually, you know, do this little fun game where I ask people, if you look at your thumb, the last digit, how much it bends to the back is actually regulated by genomics. And comparing the thumb between you and the person next to you is actually quite surprising how far back some people can move the last digit. So where are the supermovers of the last digits? There should be, so 1 in 3 actually in this room have a thumb that goes way back. So mine is pretty normal, but I can see some pretty impressive specimens down there. Similarly with coriander. So they are usually in the Caucasian audience, there are 1 in 6 coriander haters in the audience. But it turns out that in South Asia, which is right here, but these coriander haters are in the minority, so it's 1 in 12 actually. So are there any coriander haters in the room? One? So all of this is obviously a fun application. But of course there are other things encoded in the genome that are far more serious. For example, genetic diseases. Cystic fibrosis has a change at one position or 3 billion letters. One letter is changed and that results in a detrimental disease like cystic fibrosis. So therefore it's quite an important thing to study and to understand how the architecture of the genome actually influences the risks and the rest of our lives. So with that it's... The question is how can we tease out 3 billion letters? How do we find the actual disease stream? And it's a pretty straightforward application in that you have your genome and every line here represents a person. You identify how that person is different to the person, to another person. So between you and the person next to you there are about 2 million differences in the genome and these differences are these little tiny boxes that are shown here. And then you recruit your cases and your controls so the people that are half the disease and the people that have healthy. And all you need to do is find the difference. Now of course this is oversimplified and the real world is a little bit more complicated but I think for the purpose of this this actually is sufficient to understand how disease streams are actually found in the genome. So it sounds easy but when you take the scale of the genome and the scale that needs to be included into this research into account it becomes much harder. So this slide shows that genomic information is going to grow over the next by 2025 is growing at a rate that is unprecedented. So the intake of genomic data will outpace traditional big data areas like astronomy, Twitter and YouTube combined. So by 2025 it's estimated that there will be 20 exabytes per year of new data generated in that space which is staggering and embarrassing. So that is actually necessary in order to find the origin of complex diseases. So here the machine learning task of finding which is the disease team needs to be done on 1.7 trillion data points. So for example for the project mine which we collaborate with which is an international consortium looking at the genetic origin of a motor neuron disease called ALS which you might be familiar from Stephen Hawkins having that disease and the ice bucket challenge. So finding what is actually the underlying genetic course of that disease requires 22,000 individuals to be analyzed. So 22,000 each one of them will have an average 2 million differences that equates to this 1.7 trillion data points which is truly enormous. But luckily the big data industry and other fields have come along and helped us. So this is how I think about Hadoop and Spark going all the way back from desktop computers where we have one CPU on there that is nuclear. The next iteration in my mind was hyperforms compute which has many of these CPUs but all of them, each CPU is basically distinct from the next one, next to it. Whereas which is a compute intensive task so it's built for compute intensive tasks like methodology, predictions that are truly independent of each other and just purely need to crunch the numbers. Whereas in genomics and specifically machine learning we need to have iterative tasks. So every calculation depends on every other calculation and it requires the full dataset to be analyzed at the same time. So this is a truly data intensive task and for that Spark or Hadoop Spark is actually set up to do so. And the reason for that is that each CPU is not disconnected from the next CPU but it's sort of the boundaries between the CPUs to solve which is sort of what I'm showing here and then I'm going to grab it. So therefore we can build parallelization approaches that are much easier and because of the standardization that Hadoop gives us it's faster to implement in order to really crunch those large amounts of data. So with that we did the variant Spark which is basically a machine learning a random forest application for big data and our invention was sort of that we paralyzed it using really leveraging the benefits or the strengths of Spark. So with that what you can do is you can do your standard machine learning classification. So given a new individual we want to predict whether that individual will develop disease or is a healthy one but also we want to know what are actually the disease the disease gene that I said before so identifying which gene causes the disease or in machine learning terms that is feature selection. How many of you actually are familiar with machine learning? Good. So nothing really special about it as long as you understand we want to find the disease genes which are one of three billion letters in the genome and we want to know which ones are predictive that is sort of the task that variants Spark is trying to solve. So therefore even though this is genomics we're here in the business audience therefore I thought doing this exercise of thinking which other discipline actually has this wide kind of data so by wide typically when you talk about big data you're talking about many samples and each data point per sample is say the customer age, location sentiment or something like that which is usually a small amount of features per sample but here we're talking about the whole genome which is per sample 80 million variants so it's a couple of orders of magnitude larger than what traditional big data approaches are designed for so therefore rather than looking at disease status you might want to predict the churn rate or the occurrence of failure or a security attack and rather than looking at genomic profiles which are the mutations in the genome you might want to look at time series or concatenate the data or sensor data or log files now with the datafication basically of everything the data will become larger and larger and larger in that dimension as well as this dimension the task is not in your world disease gene prediction but it might be general predictive markers so which time point is actually most predictive with a cyber attack or what kind of log file information is predictive of the occurrence of failure or what kind of customer behavior concatenating all that is predictive of the churn rate so think about this when I talk about the rest of the approaches so obviously with data driven so therefore rather than saying that it's a wonderful technology and it's theoretically better we actually tested it so what I'm plotting here is accuracy so how accurate it actually is predicting the outcome versus the speed how quick can you do that and as you can see variance bar is in the upper corner so it's high accuracy and high speed whereas other technologies for example Spark ML which is Google's implementation so the planet implementation of random pros which is a different parallelization strategy is a little bit inferior to that and the reason for that is really that they did not have the need for wide data as we had to so as more applications are coming that are wide data more and more people would think about it obviously therefore the technology that will come out for wide machine learning will become better but for the time being it's sort of the first one in the game so it brought really hard to make it scalable and by scalable I mean the traditional way many more samples which is sort of shown here by the different lines so from 1000 samples 5000 samples 100,000,000 samples and as you can see it's sort of scaling subliminally the distances between the lines is getting smaller but of course the other dimension is the wideness of the data so many more features which is on the X axis here and as you can see that is linear too which is which we are very proud of I have to say and in terms of the actual money that it costs in order to interrogate data set like that we plotted a couple of numbers there so if you have 10 to the power of 6 features it costs you about $200 Australian dollars in order to analyze the data set and if you have 50 million features that you want to analyze and 10,000 samples it costs you $8,000 I mean obviously in a depth ops and continuous deployment terms this is prohibitively expensive but if you are thinking about data science approach where you want to generate a hypothesis where you want to generate insights and this is totally feasible so stepping back a little bit and talking about general data patterns so obviously whenever you have a problem you want to start with a business problem and in order to solve the business problem in a data driven approach you want to curate the data by actually helping you generate ideas, generate hypothesis that you can attest and usually involved in there is cleaning the data, visualizing the data and where to get a feel for it and once you have that you want to build a minimal viable product as in you want to demonstrate to your bosses that whatever you build is actually predictive so you start off with a small test case for that you need to scope the technology so you need to identify which cloud to which cloud provider which technology do you want to use and build the actual prototype and obviously there is a lot of iteration involved in there so building the technology and building the prototype and then rinse and repeat after you've seen that maybe the cloud provider that you chose wasn't the right choice or the technology wasn't the right but eventually you will settle down to your minimal viable product that you can show to the world and once your company accepts that the next step of course is to scale this up and put it into production and for that you really need to test it at scale and also you need to provide an API endpoint it could be an API endpoint but anything that can be called from the rest of the business that is not depending on your environment that you set up so with that thinking about how to actually deploy that or how to actually make these all four steps three steps actually working for you, you have a couple of choices one is to set it up on premise which probably is the easiest or used to be the easiest because you can just use your high performance computer in my case or your desktop computer servers that you have sitting around and you can use that and it will get you basically from curating the data doing the initial exploration to building the minimal viable product to somewhat along the line building the endpoint because obviously it's not if it's on premise it's locked in there if that's all your company needs and that's fine but if it's a web application then that is a bit of a problem so therefore it's not going all the way to the end of the preparation of production the other option is data breaks so I'm not sure if you have heard of data breaks but it's a US company and what they basically have is a Jupyter notebook a managed Jupyter notebook environment that can connect with AWS and Azure resources in order for you to do data science on it and that's pretty much the extent of it as in you can, it's wonderful for the initial data exploration but it probably will not even get you to a minimal viable product that you can demonstrate the use case on and it certainly is not a production system that you can provide so this is where AWS SageMaker explored the needs in the market of providing exactly that so they also have the Jupyter style notebook for the initial data exploration obviously you can use a whole suite of other services that they offer in order to build your minimal viable product and then it offers a nice easy way to package all of that up in Docker instances in order to provide the endpoint so we are working on the last one and next year when I come I can show you the outcome of that but for the time being I want to show you the data breaks demonstration case sorry, data breaks later took my punchline away so I wanted to show you something that I'm actually allowed to share with you so obviously with dynamic data and the privacy around dynamic data I can't just use one of my research products and show you the results so therefore we came up with a nice synthetic data set in order to solve this problem and show it to you how it actually works and this is our hipster index so I'm sure you're familiar with the hipster in Australia they're very very big especially in Sydney which is sort of the stereotype of usually IT workers with the texture, the beard and always drinking coffee so I'm sure you have those characters here as well so with that building the synthetic data and demonstrating how to do feature selection on that we did the following thing our hipster in mind, the hipster has a monobrow has beards, likes to wear textures and he drinks a lot of coffee so we took actual locations in the genome that were demonstrated to be associated with those traits and then put them together in a fashion that represents a complex disease where all of these genes together interact in order to make a phenotype hipster so with that this is our hipster formula yes, very scientific our hipster formula and we took a publicly valuable data set which had 2,500 individuals in there we calculated with this score which ones of those is actually a hipster and then we label them with 0 or 1 being a hipster or being normal so with this notebook it basically walks you through how to set up a variance bar from a data set that you might have a wide data set that you might have so the first thing obviously is to load the data and as you can see here and I hope you can see it in the back is that it's sitting on S3 an S3 bucket so this is the AWS examples, obviously there is an Azure example as well where it's sitting on block storage so therefore ultimately the idea is that to swap in and out new data set and the rest of the workflow is the same so what is the rest of the workflow the rest of the workflow is to load the libraries that we have the variance bar libraries we load the annotation labels which is our hipster index 2,000 individuals 0, 1 annotated whether they hit it or not and then we actually run the analysis so it's really as simple as just calling a particular so just calling a one-liner for doing all of that now I'm not going to go into the science behind how we actually chose which team it is let's leave that for the research but the other thing that I really wanted to show you is that typically in data science when you have a data scientist they're not coming with a standard education right they are coming from all walks of life so they have all sorts of different computational backgrounds and languages that they prefer and data breaks are generally Jupiter nodules cater for this nice so for example the first example that I have here is an SQL query we'll just select the importance which is basically how relevant a particular location is in the genome with the disease status so is it predictive of being a histone? so we can just collect the first 25 of these important genes and just plot it like that with a little one-liner like that similarly we can use Python and plot it a little bit more complicated but it's sort of the same concept that we're plotting the important score of all locations in the genome or if R is your poison of choice then this will be the R example for that obviously for plotting it's quite nice with GT40 so this is showing you the actual result so the actual result is that remember we had four locations in the genome that we synthetically associated with being a histone and Barry and Spark indeed identifies four locations in the genome and they nicely line up otherwise I wouldn't have shown it to you they nicely line up with the monobrow the fabulous hair or beard the tech shirt and the coffee consumptions in there so what I'm showing here basically is the whole genome so every dot here represents a atomic location so every one of the 3 billion letters and up here is the importance contribution how this particular location is contributing to the disease status or is associated with it and as you can see the four locations really stand out like a sore thumb now what I want to stress here this is very gene centric or genetic centric but these applications as I said could be any other application that requires large amount of features and you want to know which one of those features is actually predicted for a certain outcome so without going back to the presentation and therefore I'm really encouraging you to try it out yourself you can log into that or you can create an account a data fix, this is not a part for data fix but you can create an account for free you can copy this particular notebook and all you need to do is to point it to the big data set that you have and just follow the steps in there and see what comes out of it so with that I'm really trying to build an international community around machine learning and if it can solve some genetic division along the way, all the better so a lot of people come up to me and say well this is really nice, I would like to contribute to that as well, I want to do some genetic exploration or at least be part of the community in order to improve human health and can I do that, I don't have a science background I'm not working at research institutes how can I help? and the answer for me is yes, oh my god your background is absolutely fantastic to help in that way and let me give you a couple of examples where volunteers that do not have a life science or health science background have helped us dramatically in order to move this research forward so for example, Lynn Langit that you might know from being one of the largest contributors to linda.com she had the question well what kind of architecture is the best architecture to really do these analysis and is data breaks the way forward or should we set it up on AWS so basically closer to bare metal and what she was saying, what she was finding or doing, she was doing actually the comparison between a small cluster on data breaks, a small cluster on AWS, a large cluster in both of them and just recording the runtime between all four scenarios and what she found is that there is not much of a difference between AWS and data breaks so what is it for the large cluster, it takes 3 hours for AWS and 2.7 hours for data breaks which in terms of research terms that's nothing but this really enabled us to clear the path in order to set up SageMaker which is such a nice easy pathway from conception over the initial analysis to an endpoint so this demonstrating that really pivoted or cleared the way for us to go there similarly with Deus, so Deus is a consulting company in Melbourne they are typically working with research but also business applications and what they did was to take the Scala API that we implemented and ported it over to Python which for them was an easy task because this is what they are doing on a daily basis whereas for us this really was a game changer because there was no way that we could have the resources in order to recreate the Scala application that we already had in Python knowing that Python probably is the preferred method for the community so again this was wonderful help from the volunteers so with that I want to change track all of this and go to GT Scanty so as I said this is a genome engineering application so what is genome engineering I'm sure you've heard of CRISPR and of being the revolutionizing technology or really a game changer for making edits in the genome of living organisms and the application case the obvious application case here will be to pure genetic diseases and specifically how close we actually are to this golden age of medicine was demonstrated by a paper last year where they edited out the genetic disease particular of apatrophic cardiomyopathy which is a heart disease in Australia I think one in 500 Australian suffer from that and it basically is a really nasty disease like it makes your heart will grow thicker and thicker and eventually your heart just stops and you die from it so obviously this is a disease you want to be aware of and ideally manage and this might be one of the technology that can help us to do that but the problem was that they demonstrated that it works in seven out of ten embryos that they edited which is fantastic I mean seven out of ten diseases cured but if you think of this being your unborn child then three out of ten failure rate is way too high and this is the area that we want to work on we want to make it work the first time every time in order to really eradicate genetic diseases that are amenable to these things so therefore we coming from a computational background obviously therefore the way to improve this performance to make it more efficient is to increase the speed so make it faster to actually predict the outcome which means you can test more parameters in the same time frame so we brought it down from by paralyzing it we brought it down from a couple of minutes to seconds and by increasing the scale because researchers might want to search the outcome for one gene or they might want to search the rest of the genome as well which is 100,000 genes so therefore things that only take a couple of seconds scaling that up to 1,000 100,000 applications has not been easy in the past especially for web applications but with lambda coming along and general serverless architecture this is a match made in heaven so as such when we first heard about it we really jumped at that opportunity and we were the first serverless application that was complex enough to cover full research pipeline and as such we got a lot of attention in the international media and it's now used in a couple of high profile research institutions in Australia so what is GT scan? GT scan is basically the search engine for the genome where researchers can type in the gene that they want to edit and GT scan gives a rank list of what locations near that gene are the most optimal in order to edit that gene focusing the resources thinking about embryos focusing the resources on only the sites that actually work and as you can see down here so every bar is a site that the genome editing machinery could attach to and edit that and in green are the sites that are high activity and black are the ones that are not high activity and just looking at it, I mean they're right next to each other how would a researcher know which one is the best one so this is where GT scan really comes in so again giving you a quick demo so here the application case is basically to have an interoperable and reproducible research so rather than going manually for the researchers to type in which gene they want to edit I want to have this an API call that can be called from within a Jupiter notebook and this is exactly what I'm showing you now again think about that this is it's genomic centric but as soon as you have an API gateway involved somewhere this application or this approach works for your your research and in terms of data science this will be a fantastic application so obviously you want to set up a couple of libraries you want to set up the actual application so the actual application here is that we want to search the genome for one cell line or for one tissue, for heart tissue and for another tissue so therefore rather than as I said calling GT scan twice we want to do that automatically we set up one application case where we want to search this particular cell line which is NeuroFills and that particular cell line which is a heart we submit that to the API gateway and we collect the ideas back as you can see we get two ideas back and with that we want to collect the actual results or the actual predictions back from GT scan which is basically in this variable and these are the results these are the locations that we found and the prediction accuracy which we can identify the site so scrolling a little bit further down is sort of a typical visualization and data cleaning that you would do as a data scientist but ultimately this is sort of what we've seen in the GT scan application as well we have recreated in the API or in the Jupyter notebook with Kotlin and that particular case so again every triangle here is a particular site and green are the high activity and in black are the low activity sites for the two cell lines that we want to interrogate and now we want to know which one is actually different out of all the sites and again eyeballing it's hard but of course it's in a notebook so we can do it programmatically so going through yeah this is the actual result so out of all these sites we find that one is actually high activity in hard and below activity in the other applications or in the other tissues and if you think of an application case where you only want to target one particular tissue but want to keep the rest of the human body intact or unedited this will be the application in order to find out which site actually to edit with high precision to a certain tissue so with that I'm jumping to the generalization of this so as I said the cloud pattern that we are following as data scientists is that we start off with a problem we then find the data set that enables us to build a hypothesis and we clean the data visualize the data then we build a high point a minimal viral product to interrogate that hypothesis and if it holds we want to scan it out to production so for the why machine learning application our particular business case we wanted to find these instruments the data obviously is genomic data and we're using notebooks in that case Jupyter notebooks, data bricks in order to visualize the data sets and we'll be using different languages that are appropriate for interrogating the data a minimal viable product of course is variant spark and this is an Apache spark application that we use on data bricks in the first place but ultimately it will go to NMR or AWS NMR and the scaling out or scaling up to open it up to the research community to actually use it particularly the project MIME that I talked about with Mortar Muram disease ALS consortium 20,000 individuals 25,000 individuals this is truly a scaling up problem and ultimately making it available for SageMega as an API endpoint CT scans or the serverless application we wanted to build a persistent but compute intensive web servers and the alternative would have been to set up a hugely expensive EC2 instance for example a server that is always on great but it always costs it costs exactly that money for it being always on therefore having a serverless architecture where you only pay for what you actually use was the perfect match therefore the genomic data was sitting on S3 on a no SQL database DynamoDB eventually minimal viable product of course was GT scan which uses Lambda functions and an API gateway the API gateway was the it basically comes along with free interoperability through a notebook and then the progression for production and the scaling up was obviously the research community but here we needed to build in an auto-scaling approach solving it specifically the DynamoDB database needed to scale to having one user interrogate the data set for dreaming big 100,000 users and again SageMega will be a perfect match here as well good with that overall the three things to remember that I want you to remember from this talk is we'll come and we'll make every single data set including the data sets you're working with much wider so many more features therefore this is a true in my mind a true paradigm shift for machine learning applications that are catering for wide data not only deep data serverless architecture on the other hand is catering or can cater for compute intensive tasks even though it was originally invented for little tasks like converting speech to text by setting up an architecture that is truly designed from end to end we can visualize serverless architecture for even compute or data intensive tasks and in my mind business and life science research is not that different therefore building a community together and building tools that can be used in business but also incidentally resolve human health problems is probably a really good way to go forward therefore I encourage you to have a look at the two open source technologies that you have and if you have time and want to volunteer please get in touch with me with that thank you to our collaborator and the rest of the team and you for listening