 Welcome back. Have you ever had a DNA test? I had one recently, and I was very surprised by the results. I'm part Scottish, it seems. Well, that's not so surprising, I guess. But actually, I'm part Basque, too, and that really was a surprise. Well, DNA testing is just one example of the vast amounts of data that we all have become used to having at our fingertips. But although we are storing, retrieving, processing, and visualising large amounts of data, sharing of this data has become a little bit more limited, perhaps for bandwidth or security reasons. In this next talk, Frank Munn's staff developer advocate at Databricks will be sharing data with us, his own data, his DNA test. He assures us that no harm will come to him. So no cloning of Frank, please. Frank, how are you? Nice to see you. I'm very well. Thank you so much. That was certainly the most professional introduction I had for the whole year. So welcome, everyone, to this session. I guess it's a good afternoon to most of you if you are in Spain or somewhere in my time zone. It is. It's early afternoon, Frank. Cool. So this is my first time at BigThings. I'm excited. I hope you're excited, too. Actually, that session was planned to be a little bit more hands-on. I thought that could be in Spain. And we do a mini hackathon together with my DNA. That was the real plan. And then, you know, other things happened. My intention was to show you that it's just so easy to create a client for delta sharing. And I mean, anyway, now I am doing the whole work and you just sit back and listen. We're already unexpected. Sure. So before we dig into my DNA, I want to give you a few more examples like setting the stage and explaining other situations where delta sharing or sharing huge amounts of data makes sense. So let's start maybe with the thing that destroyed my idea of having this mini hackathon and sharing my DNA and you coding the client for reading that and finding out about genetic traits. It was January 5th when the first version of the genetic data of COVID was uploaded. It was named the Wuhan Seafood Market Pneumonia Virus. Now, the interesting thing is, you know, that it wasn't soft wet tissue that was kind of cut out from the lung of a patient, of a sick patient. It was a digitally transmitted sequence of RNA that was sent to this, you know, tiny little village in Germany where they produced the Pfizer BioNTech vaccine. So sharing data at the end triggered scientists all over the world to work on the vaccines that we have today. And that was not a singular event. Actually, today sharing genetic data is very standard thing. It's huge amounts of samples are exchanged frequently and they're used to construct those trees that show the mutation of the virus that exists and that is causing us that much trouble. Another example three years ago, I was speaking to a oncologist. Don't worry, I wasn't sick. Nobody was sick in my family. I was invited for a lab tour and the oncologist explained me that he's accumulated three petabyte of genetic data. So every patient he's treating, he's a specialist for leukemia, for children suffering from leukemia. So every child he was treating, he was also sampling the DNA and he ended up with three petabyte of data. And this data obviously he wanted to share with students. He was having like a dozen of PhD students at any given point of time and he also wanted to share this data with befriended institutes, you know, for scientific exploration. That's a sample, the raw data file that I generated. So indeed I got my DNA genotyped. It's a very easy procedure. It costs like a hundred something bucks and you, what you do is you spit into test tube. I have a photo but I didn't put it here because I thought it's maybe not appropriate. And if you think about it, the first sequence of the human genome costs like three billion US dollars and it took 13 years. I don't remember but I think I waited like a couple of days and then I got the results back. And at the end it's this, you know, tap separated file and yeah, that genotype person, that's me that you see here. So it's the same machine that you've seen on this previous picture that looks like a well, like a big washing machine. It costs one million US dollars and this professor for oncology had 12 of them in a row. Now the question is what do I, if I want to have this mini hackathon and sharing my data and this oncologist professor from Munich and the Wuhan scientists have in common. We're all trying to share data in a open way, in an efficient way, in a safe way and if you look at the options that we have it looks like this. I was trying to create this table for all of you. Now the left column is probably the least interesting one. It's vendor to vendor sharing. This is where you need to pay for another license and where you need to well create the instance of this, this, this vendors system. That's not open and that's why typically we're not interested in this. If we, if you go one to the right, look at FTP it's a very basic protocol, isn't it? It is certainly a good protocol because it's multi cloud. It's just such a low level that it would work from AWS to GCP to Azure. That's not the issue. But what you're doing is your offloading files on the FTP server. So that's not very exciting. It's not life data. What I want to show you, even though I'm not using it for the DNA example it could be life data that is transactionally safe. FTP is kind of old and not very useful if you want to have it secure and if you want to have it cloud scale as well. If I give you the keyword cloud scale you would probably say, well how about uploading images to S3 and then using the S3 URLs that you get with an uploaded object. That's not a bad idea but you're still on this level where you're dealing with the blobs with the key value pairs that you store in S3. So you have no abstraction of a table or you don't have a table that you could update. If you project all these systems that I talked about to the right, you get delta sharing, open source delta sharing and this is what ticks all the boxes. It's open source it's open format that means it's vendor independent everyone can use it. It's multi cloud. I was actually just having a discussion with one of the big cloud vendors and they said well but you're marketing this as multi cloud and I said yeah but you can't blame us for this. It's just so it's basic and yet very powerful and this is why it's multi cloud. It works with Pandas. It works with Apache Spark on the client side and this is what I'm trying to show you and you can host yourself or you could use a hosted cloud service that will make your life much easier. Now this presentation I want to focus 98% on the open source solution and this is actually what you do. You go to delta.io. This gets you to the lake house. Now the lake house, first of all there's a whole video in the video section of this conference where you get the whole walkthrough for the lake house and I highly recommend you to check out the video because I have limited time to talk about the lake house here but to simplify things is the lake house is the most popular data management platform that you hear these days it was pioneered by Databricks but now you hear everyone starting from AWS going to Oracle talking about the lake house. It's open source. This is the site delta.io you can look at the source code anytime and it gives you the best of both worlds. So you have the SQL access and the schemas and the schema evolution from your data warehouses and you have the cheap, highly available, highly scalable, highly durable cloud storage from the cloud providers like S3 for example and it's the best of both worlds actually we just won the TPC DS benchmark that showed that this architecture is performing well it won the world record so it's showing that this architecture is performing best so it's serving data from the lake house that's the takeaway and it looks like this is delta sharing doesn't look very complicated does it? So at the end I told you we have a lake house we have data in a data lake it's a lake first approach so your data stays in S3 you don't have to move it we have the delta sharing server you can host it yourself I'm going to show you the options just in a second you can either have a pre-built delta sharing server that you download unpack and just run and that's what I'm going to do in the demo or you can run it as a docker container by well just getting the image from docker hub or you can use the cloud service or you could go and talk to a back end that we host for you kind of reference implementation that is running at data breaks at the moment so there's lots of options for sharing server for the data it could be S3 it could be ADLS2 there's many other options it's open source so it can be extended easily and for the front end well it could be one of the most popular open source big data platforms such as Pandas or Apache Spark or because it's a very simple REST based API and protocol it could be any commercial client I think just yesterday Mateo or CTO announced that Power BI is now supporting delta sharing as well right and that's an example of a very of the easiest possible client and what I was doing here and I was showing this at the ODSE conference this is why I'm not showing it life again to save a little bit of time this is how a client looks like and this example is already showing you multi-platform and it's also showing you multi-cloud so a lot of people talk about multi-cloud pro and con the fact is that we can easily do it and this was a collab client running on Google talking to the back end on AWS so what you need to build a client is this profile file now this profile file and this example it's coming from github because it's the whole example is checked into github and the profile file looks like this it contains an endpoint which is the URL of the server and it can contain a bearer token and the bearer token is used for authentication now using this profile file what we do is we build a client with a client and we can list all the existing tables and this is what you see here and then once you know the table you can use the profile file and the sharing and the database and the table name of the table to retrieve the table so what we do here is load as pandas and it could also be load as spark so these are the two options that you have to either work with pandas or to work with spark then I have a pandas data frame remember one of the key takeaways for open source delta sharing is that we work with data frames and not with this file abstraction where there could be hundreds of thousands of files behind a data frame and then I'll do the pandas magic I say you know pandas data frame dot plot and plot the data alright in reality this is looking slightly more challenging but not very complicated let me just talk you through that again we have the delta sharing server that I will start in a second the interesting thing is that the delta sharing server is authenticated against S3 that means I have an IAM instance role assigned to my delta sharing server this is why it can access the data on S3 now the client is talking to the sharing server but the sharing server is not returning the data that's a very important thing so the sharing server is not a bottleneck for the data instead the sharing server is returning a pre-signed URL that is typically short lived so it could be a pre-signed S3 URL and then the client is using this URL to access the data on S3 now between the client and the sharing server we have this REST protocol which is well it's open and it's well documented in the API documentation on the desktop I will show you this in a minute and then remember this profile file that I talked about and where I showed you an example already this is the profile file that I use on the client side with the endpoint with the bearer token to talk to the server on the server side I have a config YAML file and this config YAML file I'll define the mapping how do I get to my data that is in the S3 bucket remember I said it's lake first approach you have to move your data out of the lake house right and then we should look into my data this is what I promised you that's the raw file that we have so it actually is a TSV file or TAP separated file it contains the SNPs which is a location in the genome that is known to vary between individuals and that contains some interesting genetic information this is what I want to dig into a little bit deeper and you see here for every SNP for every location or RSID that I have here there is a genotype like AA now this A and this A means it's two base pairs it's adenine and adenine again so it's double adenine and one is coming from actually from my mom and one is coming from my dad hi mom hi dad I think let's see where we are and this is what should get us to the demo hang on just a second okay so you should see my browser now and first of all I told you about delta IO this is how you get started with the lake house build the lake house all open source at the end it's just manage parquet file so it's open format open source if you want to go to delta sharing and then click on github this gives you all the code and it also gives you all the instructions that you need to get started and I want to look for minus minus conf because I remember start the server this is the way to start the server and that's what I want to do in a second I just wanted to show you how to get there then remember I told you we have this lake first approach so I have my data in a data lake that is on AWS that's the AWS console click on s3 and see what happens and you see here are my buckets and that's a bucket that I created it's called delta FM and there's several folders and there is one for genome frank and remember I told you that lake house at the end is using parquet data you can kind of manage parquet file and this is where you see the raw data that is actually not too interesting because I promised you that we will abstract away all this parquet data with a table with a data frame okay so let's go there hang on just a second change this to my terminal alright so that's the easy if I run top here this is running on easy too and these are all the files that I have and actually what I want to do is I want to start the delta sharing server with the command oops hang on sorry with the command that I was just showing you I'll start it like this and it's starting up I have a second tap here and with this second tap I want to create a tunnel and actually the server is running on easy too because I don't have my browser running on easy too I want to tunnel to the server and this is what I'm doing here with SSH and then I have a third window and remember I wanted to show you everything in open source so what I do is I start pie spark which at the end I just start my Jupiter notebook server like this and it starts up like this copy the URL and just quickly check that looks good and now I need to change again back to the browser window which is here and on this browser window not sure if I need this I copy the URL and here we go that's Jupiter open source running and talking against the delta sharing server I'll navigate to my Jupiter notebook I'll open the Jupiter notebook first step what I need to do is I need to install the delta sharing libraries I'll click on shift return that looks good it's installing the libraries I'll double check the share file that I showed you already it's pointing to localhost I'll go back to page 9 which is cool because I'm using this tunnel and it uses the delta sharing endpoint I can list all the tables that I can see let's do that it takes a second it gives me all the tables remember the buckets that I have now they're represented as tables I'm not seeing the individual files anymore and this is exactly what I want and then I can run some analysis and say well load the genome data which is happening right now so what I do is I use the profile as I explained you before I use this naming schema of share database table name it's loaded already that was a previous run you see all the genomic data it goes to 638 thousand you see the genotype then I was playing around a little bit I was doing a histogram of you know of the genome so remember we have these base pairs like GG and AG and AA now the most frequent one is TT you see like here I was doing another histogram just for fun because I could and it shows you all the chromosomes which well if you remember from biology they go up to 23 and that's the mitochondria which are the power structures in your cell and they have a different chromosome now the interesting thing is this one this is a small function that lets me access in the genome a certain RSID and if I use this function now and I'll go for this RSID it says GG and now I think you all you all know Wikipedia there is Wikipedia which is for you know for knowledge that you need at cocktail parties and there is SNpedia like remember the SNPs the SNPs now if I click on SNpedia it takes me to the SNP with this number and the SNP with this number says well if it's AA it might be brown eye color 80% of the time AG brown eye color if it says GG 99% of the time let's go back here it says GG I'm not sure if you can tell this from the webcam you probably can't but I have blue eyes so the GG is the genetic correct answer and I'm very happy that this is matching I used to have a that's a long story shouldn't tell it here anyway I used to have a trainer and he once told me hey Frank do you know about your coffee metabolism I said no he said well go and check your genome that's the number you need to check in your genome for your coffee metabolism minus AA now AA means zero it means nothing it doesn't mean that I have an increased coffee metabolism there was a study done with I think like 8000 people what you see here and they checked them for this SNP it and it was shown that with this AT or the TT genotype they have an increased coffee metabolism that means they digest coffee more quickly and this is why the effect of coffee goes away more quickly and this is why those people typically drink more coffee so my trainer was worried but I'm not genetically there's no genetic reason why I'm drinking as much coffee as I drink now there is one which is I think that's more the Instagram thing it depends well it defines your dopamine levels kind of and so minus AG that means I'm right in the middle which is cool I don't worry too much but I also don't want to be on this side if you see the GG expression higher pain threshold that's not bad it says better stress resiliency also okay but then it goes on with all bite a modest reduction in executive cognitive performance that's not what I want so I'm super happy to be in the middle the last one is the photonic sneeze reflex do you have to sneeze if you look into light minus CT which again is in the middle so I'm not really I don't have a disposition for having to sneeze now all this I think is a lot of fun and I hope you agree that this is a lot of fun but as I told you there is actually many more serious applications and it starts from sharing the RNA data from coronavirus to sharing data that is used to create a more specific treatment for children from suffering from leukemia so all this what I did here I actually recommend don't try to repeat that at home you might get some results that tell you you know you're more likely to suffer from a disease X Y set by a factor of whatever and maybe you're not prepared for this result so it's a fun experiment I was thinking many years to do it or not to do it at the end I did it I like this demo it's my real data I promise you I shared it with delta sharing so it is tunneled now but obviously I could open that tunnel to the whole world and that was my plan for Spain for big things Spain I thought you just go and create this client which is like extremely easy if we go back to to all this you do pip install delta sharing and then you need the profile file I would give you the profile file and then you say you know create a sharing client list of tables and then you say and that's the core thing delta sharing load as pandas and that's it and then you could start and try you know go to SNP dia and check for other trades like find out more stuff and check my genome if you want it I've limited a bit I'm not sharing everything but it would be good enough for a mini hackathon maybe we can do that next year in Spain anyway let me try and change back to my slides right so I've done the demo I think you understand how super easy it is to create such a client I was using pandas I could have used spark you could use the API the rest API it's a rest protocol to implement any kind of client that you're interested any commercial client or any open source client we're super interested in people who want to contribute to the project as I said it's working for Power BI it's working for Tableau there is one more thing I want to point out you haven't seen all the setup work that I did to you know to download the server to unpack the server I showed you how to run the server I showed you how to start the server that's correct we haven't looked at the YAML file we could do that but it's basically showing you this mapping from the share to the S3 bucket so you need to configure the YAML file the configuration file and at the end that's quite a bit of work I'm not saying that it's not possible but that's the thing that you do if you work with open source software what we have done is we built this into the data bricks account so if you start a data bricks notebook all you need to do is start a notebook and then you can create a share you can alter the share and add a table to that share and then you can create a recipient and that's what I'm doing here now creating a recipient is at the end someone who will then get the grant privilege to access the shared data and when I create the recipient here I get a URL and with this URL I can download this share file that I showed you that contains the endpoint and the connection and then with this share file then you're able to connect to the shared data so the whole notebook approach and you know you're free to implement this using the open source software in a very similar way but the whole notebook approach kind of abstracts away the downloading, configuration, operation, patching, upgrading of the server so if you're company pleasant and easy way to share data and sharing data I think is one of the biggest topics these days I showed you the scientific approach a lot of people talk about data meshes which is basically two things you know it's sharing data and having a governance model across all your data that kind of limits what kind of data you want to share as I told you I was limiting the amounts of data that I wanted to share from my DNA. And with this I want to conclude data sharing it's a platform independent open source way of sharing massive amounts of data I give you some examples for massive amounts of data remember the oncologist with three petabyte the Databricks workspace is abstracting away the you know all the operational overhead of running your own server and it's abstracted away as SQL so when I saw this SQL the first time I was like super fascinating I thought like wow I'm not an SQL person but you know create share at the table to share and then create a recipient and then give this recipient the grant to access the data that's just as easy as it gets the clients well we talked a lot about the clients because I was using this Jupiter client to dig into my DNA could be pandas Apache spark any open source client will do by the way again the example that I showed you was not using anything from Databricks you could have a notebook a managed notebook. Our CTO was showing this when we released open source delta sharing I think he used EMR and the same three lines to connect from Amazon EMR to such a share I talked about the commercial clients and yeah the first example that I was showing you the multi cloud example was talking from Google Collab about the reference implementation that we have on AWS hopefully I was able to show you that we try to simplify things well things is data and AI this was the data part remember the notebook approach when it takes you like three SQL commands to well to create such a delta share and with this well I told you don't do this at home backfire I explained this to you be careful talk to your doctor first or just yeah rewatch that video that's much easier and with this I want to share some links like if you want to read more about delta sharing I published how to tutorials on medium there is a lot on Github I'm unlinked in the slides I will upload them to speaker deck and you can access the slides and have a look at them again that's it from my site I'm happy to answer any questions thank you so much do you feel okay that's the first question no harm has come to you right no harm happened I told you it's read only actually there's a little do we have one more minute and I tell you one interesting fact when I visited this oncologist he told me that he's writing into the DNA do you know what they do they have a test tube to to analyze the DNA of the cancer patients and they throw in blood samples from I think like 12 different patients and I said how do you know which which you know which DNA part belongs to which patient and he told me you know what we do we add a unique identifier to the beginning of the DNA and they mimic the biological process like the way the DNA is encoded to at this unique identifier so they write to DNA I'm not writing to any DNA I'm just sharing a few snippets of my DNA with people that want to build a client unfortunately we can't confirm with the webcam whether your eyes really are blue we'll just have to take your word for it on that one I can see they're not brown but I'm not 100% convinced on that but we'll take your word for it okay we'll try that maybe next year I was particularly interested by that coffee metabolism part and I'm wondering I think I've had about four coffees today so I think I'm definitely different to you on that one look at that all I can do is well you can have this little snippet in my notebook it's on github anyway and then you look up your own coffee metabolism and you know if it's genetic or not so I'm drinking all the coffee in the world but it's not genetic these are a few things where we can't blame all parents for you know as you said this is a lot of fun but there's a more there's a serious side to it as well I've actually looked at my own DNA data and yeah I got a few shocks about my propensity to a number of illnesses but on the other hand I was worried that I might be at risk of other ones and it seems to be that not so much so it's a bit on all sides but you're right let's be cautious when we look at this stuff just for your information now some comments from people that have been watching this there's one in particular here from Alex who says that this has been a great demo so very interesting so some very good feedback for you and Alex so if anyone here wants to share their data with Frank the best way to do it is via the website of course they can contact you directly and I'm sure a lot of people will be very interested in following up either with you directly or for looking into those links that you've provided for us so so thanks very much Frank it's been a fantastic talk thank you so much indeed and I'm looking forward to hopefully seeing more from you perhaps next year hopefully thank you for the invite it was a pleasure to be here great thank you so much Frank