 A lot of empty chambers here. Oh, guess time to start. All right, so thanks everyone for coming. It's a pretty big room for the size of the audience here. But I guess that should be OK. So my name is Michael A. And I'm here to talk about how we automated the process of creating a Spark cluster that is secure and that is HIPAA enabled. So this work was done at IBM in collaboration with some of my colleagues, notably J-Rom, Shu, as well as Daniel Dean. OK, so it's probably not too much of a surprise for most people in this room here that there's a lot of information that's personal, that's being collected, stored, and processed by various companies. Some of this information can be in a form of health information, like our level of glucose, our heart rate, some results from medical labs, and so forth, as well as what disease we have and medication we take. And besides just health information, there's also financial information, like how much money we have in the bank or how little money we have in the bank, as well as the different investment portfolios, and so forth. So in order to make use of this information and generate values from this information, companies usually will deploy analytics back and in order to process information, either real time and also to use this information along with the information that they've amassed throughout the years in order to generate some values and some prediction from this information. So in a lot of these cases, what you have is this back end that needs to process information. A lot of the information can be very private to us. And the problem with that is that even though it can generate a lot of value, both for a company in order to create interesting applications, it also can prove to be a security concern. So a lot of organizations have or governmental regulations have been created in order to protect this data. Some of these regulations include HIPAA or its extension to HIPAA, the high tech act, as well as that's sort of in the health industry. Also in the financial industry, there's this PCI and various other type of regulations. And a lot of these regulations try to do is protect the privacy of the data that's being generated as well as to govern how this data can be accessed by others. A lot of these regulations also put into effect these rules in order to help companies and regulators audit how does data access. So it's just in case that there's data breaches, information can be obtained in order to figure out what data were leaked and how is that leaked cause and who accessed that data last. So all that's really important for the auditing capability of the system. So what I really want to talk about today is really a combination of two things, how regulation like HIPAA affects things of how analytics back end are designed and architected. So as developers and administrator, when you create these back ends for analytics purposes, you have to take into account a lot of security concern or regulation concerns in order to build out a platform that is compliant to these regulations. So building it out manually is actually very challenging. And there's essentially three reasons I listed up in the slides here. First of all, is that the regulation is written or written by a bunch of lawyers. And being able to translate these regulations into specific security mechanism to implement is actually pretty challenging because it's not really clear how to map one to one what these general regulations mean and try to do to a particular mechanism in your system. So that's one aspect. The second aspect is that even though you have these mechanisms, there are actually, for mechanisms, there's actually a lot of options, a lot of things and knobs you can tune that will affect how the mechanism will act and behave. So there's a lot of these things like different type of user account, how to maintain access control, a lot of key certificates, databases that has to be maintained in order to actually maintain a full secure environment. And the third aspect is not really the aspect of the environment itself, but it is more like ops environment or ops capability. A lot of time when you build out this system and to deploy the secure platform for other people to use, and you're maybe within your own company to test out, building it manually is not very effective since maybe some different groups who also want this platform want to test out their particular solution against this back end. You have to then manually redeploy all this thing, which can be really time consuming and actually can lead to a lot of failures. So you want to be able to have a mechanism that can automate this process for you. Every time you spin up a back end that is secure, you want it to have the capability of spitting up a back end that has all this mechanism in play and you just click a button and you generate a new set of cluster that has all these compliance mechanism in place. So the approach we took and what we did was we leveraged OpenStack Sahara to give us the capability of automating and configuring the back end platform. So the crux of the solution here is to expose secure building blocks that users can, or administrators can choose and select in order to build out their security platform. Given this ability, you can then automate the process of creating a back end that is, or a Netflix back end that is compliant to some regulation. So what we did was we prototyped this mechanism in Sahara by creating a spark cluster for our analytics purposes, running on top of Yarn and you see why later, why we chose Yarn because some of the security aspect of it. And also in this talk, I'll be just focusing on the HIPAA aspect. I know there's a lot of regulation out there and if you have the secure building blocks, hopefully the idea is that you can mix and match different security mechanism in order to make your solution applicable to a particular regulation. Okay, so this is kind of outlined in the talk. First, I'm gonna give a brief overview of what HIPAA is, this is what we're talking about and then discuss briefly the design architecture of our secure spark platform and then I'll say how those mechanisms that we implemented maps into HIPAA. That gives you a little bit of an idea of how certain security mechanisms will interplay with the regulation that is HIPAA. But then I wanna talk about some of the Sahara features that we've augmented and extended in order to help us with this process. And I'll summarize and give you guys some lessons in future erections, I think. The Sahara project will, or the Sahara project can take on that could be very useful for the rest of us. So HIPAA in general, for those who don't know, is essentially a health insurance Portability Accountability Act that was passed, I believe in 1996, in order to very broad stroke to help the efficiency and effectiveness of the healthcare systems in the US. So it had regulated some aspect of how data is health information is standardized, the health information record keeping and how those information is transmitted and so forth, right? It makes it easier and more efficient in some way. Also it extended to security protocols to ensure that the privacy and security and integrity of this data is maintained, right? So what data is protected here is essentially any data that's designated as protected health information or PHI that relates to medication, kind of disease you have, all information, measurements and so forth, right? All that data is protected. And who does this particularly apply to? So it applies to any entity that is providing healthcare, as well as entity that is business associated. So this is important to IT providers that consume this information, store and process information, right? So you touch this information somehow, you are now under such supervision of this particular regulation, right? So companies that do analytics on this data you sort of have to be concerned about this particular regulation, right? So in general SIP has two rules. This is privacy rule and this is security rule. So privacy rule is kind of intuitive in the sense that it helps protects to privacy to PHI. Now security rule is the one that actually operationalizes this privacy rule, right? So in general what it takes is that we need to provide confidentiality, integrity as well as availability of the data, right? And beyond that, the regulation also specifies that we have to take enough precaution to anticipate any type of, anticipate any type of potential threat to the system and potential leak of the data, right? As well as ensure that your workforce, your environment is compliant at all time, right? So high tech took that and kind of extended a little bit, adding regulation that require maintaining audit logs as well as policies to help evaluate the system as the runtime of the system continues, operationalizes the system, right? So maintain your, make sure that the appropriate level of security is maintained and so forth. So just focusing on the security rule aspects and that's the one that you can actually do something about, right? There's actually three aspects. There's administrative, physical, and a technical safeguard. So I'm not really gonna talk about the other aspects here in terms of administration and the physical part. I'm just really gonna focus on the technical part since that's something that we can do in software, right? So that's why I titled this talk as not HIPAA compliant analytics platform. It's really HIPAA enabled. Since it be compliant, you have to take into account many other aspects, not just the technical aspect, right? There's administration aspect too. So within technical safeguards, right? There are four things, this access control, audit control, integrity control, as well as transmission security. So you see later on when I describe the platform and I will go through the different mechanisms that were used in different levels and I'll say how those levels then actually match up to the different control mechanisms that is specified in the regulation. So let me just pivot a little bit here that I just kind of really briefly went through what HIPAA is and let me pivot a little bit, describe essentially the analytics platform that we've developed and put it into a particular context so you understand why we built this platform the way it is built and then give you the mapping between that platform and how HIPAA's regulated, right? So the goal of this platform we want to build out is very general in the sense that we want to be able to plug this Spark Analytics platform pretty much as a service to any, as a piece of the entire platform that processes protected health information, right? So this is an analytics backend for that piece. So there's a lot of other pieces like storage, data lake that has to be there, there's a lot of authentication mechanism that has to be in place to enter into system but at some point as data will flow and hit the analytics and analytics engine will have to pick this up and process this very protected data, right? So this piece is sort of an independent piece we want to build out, it's very secure in a sense that it has maybe potentially more mechanism you need to be HIPAA enabled but it helps the developer in the sense that you take this piece out and plug it into some other platform that requires this particular security, you have it there, right? So that's also why I think the ability to build building blocks, small pieces of security mechanism to be able to plugable into your system is helpful in a sense that if you have a different platform, you different assumption about how security works, you can either remove or add in different mechanism as you go, right? So that could be a very useful thing there. So there's some of the assumptions I've listed here is this, right? This system has essentially two types of user. One is an administrator. So the administrator can go in and create, using Sahara, stand up this platform that is the analytics platform that's secure and then there's the user of the platform which is essentially the user that just goes and make use of the cluster or the analytics part, right? He or she doesn't need to know what are the mechanisms underneath it. He or she just submits a job and runs the analytics on this data and this whole process will be secure and the different user will be isolated from each other, right? We assume that given one analytics cluster, there are different users in this cluster, right? So there could be like one hospital but many doctors in that hospital can use this analytics cluster to process information for different patient and each of the doctors can maybe have different access control. Maybe he can access, Dr. A can only access a group of patient, you know, Y and Dr. B access patient Z but not the other way around for instance, right? So there's different isolation mechanism in place for the users in the Spark cluster as well. So we also assume that there's one essentially yarn cluster that runs Spark per tenant. So you have this different hospital come in, we're gonna spin them up a different cluster that's separate from the previous cluster, right? So different, it's not multi-tenant in that way. Even though the definition of tendency can be, I know, different for different people but that's how we stayed in our particular use case. So any data that's external to the cluster, if you have a data lake that stores information about patients, we assume that there's a driver that can securely ingest that data in. So essentially that needs to be a driver's implemented in Spark or you can be leveraged by Spark to pull data in from a lake. And then from that point on, our system takes over to make sure those data when analyzed is protected and is compliant to the regulation, okay? All right, so here's the design of the Spark service that we have. Essentially broken up to three different layers, right? The very first top layer is our Spark cluster. That's where all the Spark analytic services running, right? So the different user runs and different Spark jobs will run in isolated containers. So in yarn, they call this secure container and what that just really means is that the user, sorry, the container executing the job is executed as a user ID of the job sub-inter. So in general, sometimes you build a Spark cluster and there's essentially one user, let's say Spark, and every job is submitted to that cluster, you run as this Spark user. So that is good in some sense, but not great in when you want isolation between the different jobs, right? If you have multiple users and you need to have multiple user account that you can actually have access control across the different jobs, right? So secure container is great in that way. That gives you the ability to start containers with the process ID of the job submitter. Below that Spark cluster is actually the yarn resource manager, right? So it has different components in there which is not too important. One important component is that it has Kerberos. So there's yarn components really tightly integrated with Kerberos authentication mechanism, right? This allows you to force the authentication of different components in the yarn, different components like the host itself, the VM that is launched, that is associated with the cluster, as well as the framework that's running on top of yarn, in this case that's Spark, right? We want to ensure in an environment where you can't have some other user going into your cluster creation mechanism, create a spin up of VM and say that, hey, I'm gonna run a yarn process. Let me attach to your yarn cluster without being authenticated because then you have essentially a rogue yarn node that now can consume and participate in all the data ingestion of your original secure cluster, right? So you need authentication that way. You authenticate across machine, VM, as well as the framework running on top of that VM in order to ensure that there's isolation and security across the entire stack. So all that is run on and provisioned by the Sahara service, right? So we use Sahara to automate this process, a spinning up this cluster that has all the security mechanism in place. All the nodes are set up with all the SSL encryption, all the authentication mechanism Kerberos, and also extended to have different user account created for a particular tenant, right? All that has to be there in order for this to work. Okay, so that's sort of the layout of how our Spark cluster is designed. Now let me go back and just map some of the security mechanism I just said into this HIPAA regulation, right? So I broke that, this table sort of breaks down the different four controls that I mentioned earlier, the access control, audit control, integrity as well as transmission and security, right? So it's probably not mind-blowing in any way, but you can see the mechanisms that I've listed in terms of access control, for instance, right? There's Kerberos for authentication, the different components, there's secure containers for isolation of different users, right? There's HDFS file permissions you need to when you use write and store data for a particular user, I need that, and HDFS security zones in order to encrypt the data at rest, right? So encryption happens both on the disk level when data are stored, as well as during transmission, right? You need to have that encrypted as well when you're transmitting because Spark does shuffling so it does sending data across different nodes for processing you. That path also needs to be encrypted, and any data that is being generated, cache and so forth, those also need to be outputted to directories that has encryption on it to make sure that no data is ever exposed, right? So I sort of skipped the path of what happens to data when it's in memory, now it's unencrypted, right? So there's actually an issue there, and it depends on how paranoid you are. You can have mechanism that also does encryption in memory as well, right? So often there's a penalty in performance, but you could do that as you want to. And actually that goes back to my point that there's all these different mechanisms, and the regulation like HIPAA never specifies exactly how to actually implement this thing, right? So it just tells you that you need to make sure that these things are really secure and maintain isolation, but then tell you actually how to do it. And you can be paranoid as you want. So having mechanism like Sahara that we built as a heart allows you to essentially select different levels of security. It can be a very useful thing to help you deploy your system in a way that you can test out the functionality to compare the performance of those things versus the security level you get, right? So that can be a very useful aspect. I won't actually go through the rest because some of it's kind of intuitive. I want to point out is in terms of audit controls, right? We have the ability out there to already log what type of job is submitted, what type of files are accessed, but one thing is that the framework accessing the data, right? In this sense it's Spark. Injusting the data to process, Spark itself right now doesn't have mechanism to log what data did it ingest, what RDD did it form, what piece of data it touched, and when that piece of data is written out to disk for like caching and so forth. None of that is log, right? So what we did in IBM Research, we actually extended Spark to create a logging mechanism that allows us to actually log the pieces of data that is read on HDFS that it can log and send to the auditor so that it can also be part of the logging mechanism that is there, right? So to make sure that the whole process is, every piece of data's access is logged, essentially what I want to say. Okay, so let me now touch upon the Sahara piece that we've implemented in order to make this possible, right? So there's essentially four pieces of work here. One is automating the security and enablement, right? That's one piece. There's also a piece that allows extension of Sahara to enable adding essentially different users to the cluster, right? So that's not quite there yet, and it's also debatable whether this belongs to the Sahara API or not, but we extended API to give us this ability. And third and fourth thing is the ability to submit different type of jobs given one particular yarn cluster. So I know this exists in the Cloudera version in Sahara, but the vanilla version of Sahara running like just vanilla Hadoop or yarn, right now it doesn't have the ability to submit both like a Spark job as well as a Hadoop job very easily. You can do one or not the other. So now then I'll summarize some of these lessons. So just in case people are not familiar with how Sahara works or how to spin up a cluster that is using Sahara, right? So Sahara is done in a VM image-based way. So essentially what you do is you take a VM, you create a disk filled with binaries that you want to actually run in your cluster, right? You do that out of band by some administrator. You upload it to something like Glance. Once you have that, then you use a Sahara interface and to define what your cluster looks like. I wanted to say a cluster of let's say five nodes. Each node has certain flavors of VMs, certain size, what process I want to run in those VMs. And then once you have that particular template, then you can go to Sahara to deploy in provision, right? So the provisioning engine of Sahara essentially just leverages heat, right? So it generates a heat template and uses heat to then create and spin up the VMs for you, right? And once that is done, the heat talks with different components of OpenStack, Nova, Neutronic, Glance and other things to create those VMs. So once that is done, there's the last step where Sahara actually comes in play again is that it will SSH into each of those VMs and then configure the VMs in a way that is appropriate for your particular deployment. So if you want to run Hadoop, it goes in there and does some configuration settings and changes and then it launches the Hadoop daemons to then finish the process, right? So in some sense, it's not very clear in terms of how our API, in terms of how much do you want to bake into the images and how much you want to let it configuration at runtime, the configuration take over. You can actually SSH in, then do the pool to get the binaries and then configure, right? So there's actually no clean way of how to set it up so you have to use your judgment in terms of how much you want to pre-create and how much you want to do it at runtime. So this is the proof of concept that we've... I want to show you some of the screenshots of how our system looks like and what changes we made into the gooey part of the Sahara to instantiate the different configuration mechanisms that I mentioned before, right? So this is a familiar horizon interface. We stripped out everything except the Sahara interface. So you have your typical clusters, your cluster template, no group, job submission, and so forth, right? So those things haven't really changed. So what we've added was a couple of things. One is when you create your no group template that we've added a few more services here that pertains to security, right? So there's the Kerberos that you can instantiate. There's the KMS, which is the key management system in Hadoop, right? So that's useful for HTFS encryption. We've added something that's not related to security but it's very useful service I think in general when you wanna run a kind of analytics platform, right? There's something called IPython for interactive query with Spark. There's the Spark job server was an external project to Spark that allows you to submit jobs using REST APIs to Spark itself rather than going through the Sahara interface. I'll say why we use that because Sahara interface API is missing one crucial thing that we needed that wasn't supported. I think that's something that we can ask the community to interested in adding. The other aspect of this is more security aspect which is when you create a cluster you can select different types of security modes. So right now we only essentially have two, right? We have HTF encryption, you can enable that or you can enable both the secure mode of yarn which is all the authentication maximum turn on. Now that's between the different components of yarn as well as authentication with the framework and set up as an encryption between the different components of yarn as well as the components of the framework in this case Spark and that's enable, make sure the shuffling is encrypted so all that gets turned on by just that click. So once you click that option when you deploy your cluster the full mechanism of data at REST encryption as well as encryption during transmission is enabled, right? There's also a Kerberos server location here so I'll bring up a little bit later why there's a particular field for that because there's a different way you can design how Kerberos server should be run, right? There's you can place it within your cluster you create so you have a Kerberos server per data analytics Spark cluster or you have one that can be shared by everybody, right? So there's con and pros for each of that scenario. So the last thing that we've added was essentially extra credentials because what we did was we extended Sahar to not only instantiate a cluster on top of an open stack cloud but we also extended to use other cloud in this case, software. So then we obviously need credentials for the software account, right? So that's essentially what this is. So since I mentioned that we used Sahar to instantiate on software there's a lot of components that we make use in open stack but also some components that we don't have to use anymore. So this is only pertaining to our particular proof of concept, right? Not necessarily general to everybody. So we make use of essentially Sahar horizon to Keystone we don't actually make use of any of the Nova and Glance because the images are now uploaded to the software's image repository and then we use the software's API to instantiate the VMs on software account. So from a Sahara portal we define what a cluster looks like when we click launch it actually goes to software's account and creates the VMs in software. So now you have Sahara essentially instantiating configuring VMs and cluster of VMs on top of software. So this is the flow that I showed earlier how Sahara works and the only thing we've changed is the one I highlighted in slightly orangeish color there. So instead of the heat template talking to different open stack components now we have a new heat resource plugin that we use that will then instantiate VMs on software by going through the software Python binding, right? So that's the only piece that we changed that allows us to now create and launch VMs on software instead of open stack. So the process of setting Sahara up to do what I've just said is pretty much follows the normal route of booting anything in Sahara except for the four, five and six. So this is what we had to do to change Sahara's internal code in order to enable the configuration and the deployment of security functions in Yarn and Spark, right? So we have to change how when the cluster comes up we have the SH into these machines and we have to configure the different configuration properties to make sure that we enable some of these feature, right? So that's pretty straightforward. Maintaining the Kerberos, the keys and SL certificate is also that we have to do as well as creating user account and setting up the different HDFS directories for each user is also automated in this whole process, right? So in terms of the image that we prepared what we've ended up doing is that we've baked these particular binaries into the image, right? So we have Hadoop, Spark, the Kerberos engine as well as the two SparkJob server as well as Jupyter or this IPython thing. So that's essentially the type of binaries we have in our image, right? So given those image we can then spin up the cluster and then SSH into the machine using Sahara to then configure all the different options that we want, right? So right now this image creation is very manual. I sort of have to do this manually. There isn't a disk image builder that you can extend to do this, but this works mainly, this image builder is useful for the OpenStack creation process. To automate this for like software, you need to add some extension in order to make it automatically push it up to the software account that you want to use. Because we set up our cloud on the software account rather than OpenStack. So one aspect I want to touch upon is that a lot of this security mechanism is leveraged on how well you can authenticate your user, right? So Kerberos is a very central key to this piece, right? So this essentially shows you what housing security and Kerberos work, but I just want to point out one particular thing that has to do with how do you manage when you're spitting up clusters and you have to interact with Kerberos. And that is how do you manage authentication with Kerberos, right? So normally as a user or a human being you can type in your credentials as you log into the Kerberos system with processes and things like that. It uses what is called a key tab. That's essentially an encrypted key or file that is associated with that principle's password. So that essentially is a file that is stored somewhere and that process has access to that file. And that file essentially the key that you say here Kerberos, I am this person that I say I am, and then Kerberos will give you a key that goes through some exchanges to then give you the credentials to access that service you want, right? So this is a very essential piece of the authentication mechanism. And this key tab essentially is a file. So what we do now is we install this file per user in the local file system of all the nodes on that cluster, right? So that is only probably a temporary solution that's probably not good in the sense that you able to break the account of that particular user or break the account of different users you can access to your keys, right? So probably something that could be useful later on is actually use some kind of key management system where you put the key tabs in that management system separately from the cluster of analytics cluster you have. So I just want to point that out as an important piece of security that is kind of left out in this design. And also with respect to Kerberos itself also there's different ways like I mentioned earlier how you can run Kerberos server, right? So you can have Sahar controlling various cluster of analytics, right? So you have a Spark cluster that's one, two and up to N and it's controlling all these pieces, right? So now the question is what do you do with this Kerberos server? You can, you have different configuration you can actually insert it into your particular cluster so the Kerberos is only local to that cluster or you can have a separate Kerberos server that's shown that figure there and then different cluster can then access that Kerberos server, right? So this pros and cons of this thing obviously if you share it that's more efficient but then you can run into issues of collision in terms of namespace and so forth. Now you have to, there's ways to overcome that but that requires a more complicated setup in Kerberos, right? So you have a Kerberos for each cluster then you can overcome that issue but then you have the overhead of maintaining different Kerberos server. But what helps you is that if it's Sahar you don't have to deploy this manually once you set it up once, configure it once you can click a button and it will deploy multiple instances exactly the same and you know with different number of VMs and sizes as you want but once you do it once you can overcome this problem, right? So this is a little small, it's a little crowded but this essentially goes to some of the steps we took in order to automate all these setting up of the secure Spark service, right? So during creating the cluster, right? There's multiple things you have to go through. You have to do things that are per node that is setting up the Kerberos, setting up the key tabs per VM host, right? And generating the different keys for different yarn services and then pushing those out to different nodes. You have to set up the configuration to secure container that requires editing a lot of Hadoop configuration, all that's done for you automatically and then per user you have to configure also the principle for that user, right? If you have a different user they have to create a different principle the user will store them somewhere and as well as setting up the SSL security certificates and all that stuff per user and per VMs, right? And you push that out and you use Sahara mechanism you sort of have to program this once and now you have this ability to just push and play essentially. You also have to worry about, you know Sahara gives you ability to scale in and out your cluster and so you also wanna make sure that when you do this you can leverage the fact that when you scale out you're adding another node what that means is that now you have another node joining your system, this is a legitimate node so therefore you have to add the different key tabs and CA, sorry certificates for that node and update all your existing node to make sure that those existing nodes now know that this node is legitimate and that it can successfully join the cluster that you created, right? So just make sure that the security that you push out is going to be a legitimate one rather than having some random node coming up in a cloud and joining your cluster. So that is sort of an assumption we made in the sense that different nodes can join your cluster unsecurely so you have a setup where you do some kind of VLAN where you have separate network for your system that maybe that mechanism might not be too useful and you can not use that mechanism at all, right? That depends on how you're set up if you're configuring your network stack. So that's sort of enabling security, right? So there's also the aspect of adding multiple users to a Spark cluster on using, or any analytic cluster in your environment, right? So right now Sahar doesn't give you the ability to essentially stand up a cluster that has multiple users. It stands up a cluster with a single user and every job you submit to this cluster essentially run as a, I think it's a Hadoop job right now by default. So that's not very useful in terms of you wanna maintain isolation between different users of your cluster, right? So we would like to, or we have done and we would like to make this known that we extended the API to add an added user through Sahar. So this way the administrator can go in as part of the onboarding process I'm sure there's probably a lot more complicated than that but at some point you probably need to call this API to add this user into your cluster and this user can be instantiated creating different UNIX account setting up HDFS permission directories so that the new user can now be isolated from different users in your system, right? So what we did was we extended the REST API this v10 file to add a new API for adding user. So this user when you invoke this API all it really does is these three things, right? Go on to every node to create a new UNIX account, create HDFS directory account and then distribute the Kerberos key for this user to all the nodes in the system that it needs to access in order to run its Spark job, right? So one of the challenges we found while doing this was that the assumption in Sahar is that the cluster when a cluster is created there's only a single user and that's a dupe. So we have to essentially go through and modify how the assumption of Sahar code works so that we make sure that it's not hard coded and there's a single user being used. We want to make sure this is more parameterized so that it's later on be easier to extend the different user in the system. The third thing that we did here was allowing Spark jobs as well as Hadoop-based job to be submitted to a cluster. So I know there's ways to submit a Spark job already and ways to submit a Hadoop job already. Combining those two in the vanilla version of yarn isn't there, so we did something very simple, right? We didn't have to extend any API or anything but we allow the Spark jobs and Hadoop job to be executed in the same cluster by essentially just taking in what tab was submitted based on that job type in the cluster type which essentially create a string that is Spark submit string with the right appropriate parameters for submitting to yarn and submitting that string to Spark master to run. So that's not, it's a simple matter and it doesn't take too much work there and I think the implementation is pretty straightforward as well. So the last thing we did was, as I mentioned, we did run our cluster on SoftLayer and this work was actually done not with this project but with something else some years ago by different members of the group that added a heat plugin in order to allow heat to work with SoftLayer, right? So essentially added another heat resource that's a SoftLayer VM and given that resource and the Python binding to SoftLayer you can create a cluster on top of SoftLayer. And this is just a screenshot of what had to be extended in Sahara in order to make use of this feature and that is like I mentioned before once Sahara generates or provision a cluster it creates a heat template, right? So the only thing you have to change in order to make Sahara work with another cloud is essentially to change the output of this heat template to make use of the resources that a heat template is tied to in this case a SoftLayer VM. And once you do that then Sahara essentially works which can work with another cloud. Okay, so let me just quickly summarize here. So I talked about Sahara requirement and some usefulness of being able to use something like Sahara to automate configure your deployment of your system to make sure that these deployment is compatible or compliant however word you want to use to this particular regulation you have, right? And I talked about some ways to extend Sahara to make this happen. Now one interesting or a few couple of things that we've learned from doing this is that it's not always clear what security mechanism you need to employ in your system to make some aspect of your system compliant to a regulation, right? It is never clear, it's not, it doesn't list out a particular mechanism you have to use. And having ability to selectively allow your administers to select a different component security is very useful so they can try out different mechanism and test the performance versus security that they get, right? So that's a very useful thing to have. So component-sized security is very useful and being able to automate that using something like Sahara is also very useful for the community, I think. And also the one aspect we found out is kind of interesting that when you, even Sahara with the GUI, right? You can just click and point and click to define your cluster. It's actually very intimidating for a lot of users who use the system first time, right? So it's very useful to have very simple templates that maybe doesn't cover all the option but give them very simple templates to use and users can just select one of those templates to use directly, right? So even with the ability to select different options to define the template, it's actually already intimidating for a lot of users we found out. So some of the improvements that we think could be made is, first of all, is to actually identify some more of the security components that you can take out and isolate and implement that in Sahara as essentially a unit of security. You allow the people to enable that through Sahara that could be a very useful thing. We've added API to add users so we never delete any user or modify users through Sahara so we don't have any of those mechanisms. One thing that I mentioned that with the API for data analysis through Sahara or job submission through Sahara, there's no way, once you submit a job run, there's no way to actually retrieve any data or results using the Sahara API. So when you submit a job, you sort of, it's assumed that the data's outputted to HDFS or something and that's it. Their user has to go some other mechanism or other route to get this data, right? So it'd be very useful to have Sahara API to actually get this data when it finishes the job. All right, I'm gonna skip the last point there. So thank you for listening and I will take any questions you might have. So it was perfectly clear and everybody agreed what I said, that's good. Okay, so I'll be around if you guys need to want to talk and chat. I'll be happy to take your questions. Thank you.