 Hello, welcome everyone to our talk. We will talk about something really interesting, I think, in every kind of product that is putting something in production. But in our experience, when we put in something in production using big data technologies, the most tricky part was putting in a security environment. Why? Because all of these big data technologies aren't thinking for security purpose. They accomplish but not thinking about it. And putting all of these together is kind of hard. So this will be the index that we will follow for this presentation. We begin with a little brief presentation about ourselves and our company. And more importantly, our product. Later, Marcos will explain the particular use case that we are going to put in production. And then I start to talk with the first use case that we have to accomplish. That is the fusion of Spark, Kerberos and Mesos. Later we have to tell how can we manage the sharing of all the secrets of the platform. We call it in a strategy of dynamic authentication. Later, Marcos will explain another kind of system authorization like this mutual TLS. And a very tricky data store that uses this TLS that will be in progress. And I will end it explaining the last offer layer of security layer. That is the network isolation. And if we all goes well, we want to show you a live demo. Cross fingers, please. And if we have time, we will answer a lot for your question. So let me begin with the presentation. My name is Jorge Lopez-Maya. I have been in the computer and the software development wall since, I think, eight, nine years ago. And more focus in the big data wall since the last six years. As you can see from my skills, my primary skill is Spark. I have been involved in the Spark development since the last five years. And since I entered in a strategy three years ago, I started working with Spark over Mesos. And I let Marcos introduce himself. Hey guys, thank you for coming. My name is Marcos, as you can see. I'm younger than Jorge, so, well, that's pretty obvious. And I've been working in this wall for five years. And my main skills are tooling. I like Docker, I like background, everything that let me hack something. And testing framework, which is the last point. And, well, since last year, I've been working a lot with DCOS, Scala and Spark. I'm a Spark committer, so, well, I like to work with these guys. And I'm more in the QA side, testing performance and so on. So, after the presentation of Marcos, I would introduce our company. We work for Stratio. It's a startup that is based in Spain, in Madrid. And the purpose of Stratio is a company's business on the journey through the digital, the complete digital transformation. And how can we accomplish all of this, using our amazing product that is right here that, as you can see, or if you don't see, I will point it to you with a laser. That is based mainly in DCOS, the data-centric operating system, powered by Mesos that all of you will know. And if not, as good as the Mesofe people that is outside. And, as you can see for all the Spark logos, Spark is our main processing distributed engine. And all of this application, or whatever that we launch inside of the platform, is using Docker containers. And last but not least, we have a very complex security layer that is called StratigoSec. That is the main point of practice in our platform. So, Marcos will explain the use case. Okay, so let's talk about our use case. Well, to start with something, let's say what's DCOS. DCOS is an open source distributed operating system. It's based on a Mesos kernel, and it's really, really useful for all the things and the things that make it powerful as hell. It's like you have a network layer, which is really, really, really good, a service discovery and resource management. You can deploy containers inside, and it's also a good thing nowadays. This is the main DCOS architecture. I want to explain it. It's in the DCOS documentation. And well, let's talk about our preconditions. We have an important customer that needs everything really, really secured. And the things that they ask for are mainly, okay, we need user profiling, good user profiling in order to avoid data scientist's access, things that they don't need to see or whatever people want to see. Let's check that they are able to do it. The access to these resources must be done using secure connections from end to end. I mean that if you start the interaction with the cluster with, I don't know, HTTP request, I don't mind, whatever you imagine. From that point to the GBM, or from Python or from whatever, it must be secured. Not all the services are secured using the same protocol. This is one of the most important things that we need to work with because you have Kerberos, you have TLS, you have a lot of ways to secure a service. Clients don't want to manage their secrets. That's another important thing. They want it simple. They want it now, and they want to be isolated from the management of the secrets. And as Jorge said, Spark is in the core of our processing system, so we need to integrate Spark with all those ideas above. And now Jorge will tell you how we did it with Kerberos and Mesos. So as we have said before, Spark is our mainly processing distributed engine. So if any of you doesn't know, I will introduce you to Spark and the love that it has to Mesos. For those that doesn't know, Spark borns as a Mesos use validation, use case validation. And this is common on the typical architecture for any Spark application. We have a master slave architecture with the master will be in Spark application, the driver program that will run any executor in worker's nodes. That the executor will be the ones that really do the execution. That will execute the task inside of this process. All of this is managed by the cluster manager. In our use case will be Apache Mesos. And for those of you that doesn't know, Spark can be deployed both client and cluster mode. For experience, all in enterprise environments, only the cluster mode is recommended. Why? Because in client mode, the driver program is running in on or local machines. And if you put something in production, you know that this is impossible. You cannot run something in production that running in your machine and sending something in a cluster environment. In order to introduce you the functionality of Spark running in cluster mode, let me show you an animation that shows all the process that is done using Mesos as resource manager. First of all, we need another services beside Mesos master and its agents that is called Spark dispatcher. It's running inside the cluster in one of the Mesos agents. In our use case it's running as a marathon application. So the driver will send a web request to the Spark dispatcher that will ask for the Mesos master to execute one application. Which application will be running the driver itself? For you that doesn't know, when a driver is registered in Mesos, it's registered as a Mesos framework. So the driver talks with Mesos master and started a Mesos scheduler and running any executor that will need it for its Spark application life cycle. The taxi is sending from the driver to the executor and will do this magic and read from HDFS or whatever it needed from the use case. So, okay, this is good. But what happened when we put Spark and HDFS inside our platform? That the first point that we have to accomplish is the secure part. For you that doesn't know, all the projects that are inside the Hadoop ecosystem or dependent of it, like Yarn or Map and Reduce or whatever, is using Kerberos as a security protocol, authentication security protocol. Kerberos is a very complex protocol. It's not the topic of this talk. If you want to read something, I recommend you to read it because it's very tricky, very interesting, but it's very complex. So, what Kerberos does? Kerberos handles user and system identities is quite important, the partner of system, too. And it's visiting three main concepts. Principal, that is the user or the service identity, that is nothing more than a string. Kitab, that is a file that is the one with a token that is being used as authorization using Kerberos. And one important thing when we're running something in Hadoop, delegation tokens, as Kerberos is not thinking, it's previous of the big data wall, it's not seeking for the big data use case and problematic. So, if we use directly Kerberos, we can set it down Kerberos, or this is what the people of Horton say to me in previous convention. So, they do something that is called the Hadoop delegation tokens. This is important for the seeking of the company of the different data stores because with this, they have to say to us that Kerberos is not used to run in Spark. It's only used if we run in Spark with HDFS, not for Cassandra or Elastic or whatever. So, we have to do something more. This is going to explain later, but it's important. So, what the strategy of Spark Team does, we integrate all the functionality that is not present right now in the Apache Spark compilation of Kerberos inside the part of Mesos. So, we can run any Spark application running inside Mesos using a Kerberized HDFS. It is integrated in several Spark versions, and we added a new cool functionality that is the in-person effect in real time. That is explaining in the Spark this is that I have talked in this year, but I will put it in animation in order to show you. This is a common use case for Spark. We have a lifetime Spark context that is not closing at any time, and we want to read using several users. So, the first one will be user one that wants to read the text file from HDFS. First of all, the driver will start its interaction using Kerberos and obtaining its tokens, okay? And we'll impersonate user. As you can see, these lines are in black, so it's another user, and the ones that will read from HDFS is in blue. It is the user one that will be read this file. So, as we read from the driver, send the text to executor, that will start its delegation tokens to play with Kerberos, and read and write any information from HDFS using its own identity. What happens if the next task of Spark will be running as user two? If you have any other compilation, you have to put in your context and launch another one. But with our solution, adding some simple variable to the code, we can use another user, sorry. This is a cool explanation about it. So, okay, as I have said before, this is the end of the animation, okay? As I said before, we have several technologies that will run using several concepts of security. In Kerberos with HDFS is the kitab, the kitab file that will determine the user, okay? So, as we manage a lot of technology, we have to manage a lot of kind of secrets from the... If we use HDFS, we have to use kitab, if we use another one like elastic or whatever, we have to use TLS and we have to store certificate or whatever. So, what happens when we realize that? We realize that we need some bolt or whatever, some storage that we keep all for secret. So, our study security team integrates KMS, that is the key management system, in the study platform in order to store and manage all of this secret. This manage is... The access of this KMS is made using tokens. So, we have a question. Okay, we have a secret storage, but how do our process access it in a security way? Because all of our, as you can see, all of our OSPARC application or whatever is running in a distributed way, so we don't have any control of it. So, to be more specific, who keeps the key that opens the secret ball? And more graphically, how can we hide the master key from the bullies that want to obtain it? To show you this problematic, I want to do another animation, so explain it to what we have done. As I said before, we have a marginal application running inside of cluster and launch a request from a Stratiospar dispatcher. The Stratiospar dispatcher launched in another agent, and a Stratiospar application comes to the... Request comes to our Stratiospar dispatcher with its own application tokens. Right now, this is stored in the metadata volume in order to obtain the metadata of our application. And the Stratiospar dispatcher will launch as part driver and will send the application token. As you can see, this is not secure at all, okay? So, with three problems we have to face. The application token is waiting in the executer logs because we are using Docker, as I said before, the only way that we come to pass any argument to replication is via execution variables and it's waiting in this log. Also, the sensible information is then using a no-secure-test protection layer because we don't have TLS activated in the cluster, and the application token is also waiting in application metadata. What is the result of this? The bullies is still laughing at us. So, we have to do something from all of these security point of view. What happened right now? That our security teams integrate some cool new feature that when some application is running using marathon, it will ask for a secret and roll ID that will make possible that this application can be logging to our secret vault securely, okay? And when the Stratiospar dispatcher started its process, it would go to vault and obtain its own token. Why? Because when a Spark application comes to the Stratiospar dispatcher, it doesn't come with a token like the sample before. It comes from a role that is not sensible information. When we have this role, we returns with a secret and roll ID. The secret and roll ID is the one that is sending to the Spark driver, and the Spark driver will also do logging in the vault and return with the application token. But as I said before, the Spark driver also launched any Spark execution. So, when the Spark driver detects the amount of execution that is going to be launched, it generates some new method that is quite the same that the secret ID and roll ID, that is the one-time token. As his name says, they only have one use, and he sends it from the driver to the executor. Okay? As you can see in this picture, and with this one-time token, we retrieve the application token. So, what is the result of this? All of this. We have a method with no sensible information is shared because it's only one-time user or one-time tokens or whatever that can be detected if someone intervened this token and no sensible information is shared in any metadata. So, what is the real thing that our circles are safe? And if some bullies want to stall a secret, we can put it in jail. So, we can put it in jail, and this is our own feature, I think. And, let's start, Marcos will explain this part. Well, let's talk about mutual TLS. As we said, there is another kind of security solution. When I tried to search for a good picture for this talk, I found this. I just searched it, and I don't know why. Well, this guy, everywhere. So, let's start by telling us mutual TLS. Okay? Mutual dedication refers to two parties how to dedicate in each other. So, it's like, okay, I'm your friend, I know you, you know me. And most big data to knowledge allows the implementation of this protocol. Okay? Let's explain briefly this protocol. It's based on two main items. It's key store, trust store, and, of course, the parties. We have two parties here, Superman and Batman. So, let's check if Superman and Batman are friends or not without seeing the film. Superman has his key store here, which is his ID, and has his trust store, which is the folder where he stores his friend's IDs. So, to start with this, the communication will be something like, and this is the handshake. Okay? I have a key store. I'm Superman. I send you my ID. You, as Batman, rely on me. Do you have my ID in your friend's folder? Yeah, I have. Here is my key store. Do you have it in your trust store? Yeah? So, the communication can begin. Okay? And as you look for TLS and Wikipedia, you will find this that makes real sense. That is, as it requires provisioning of the certificates to the clients and involves less user-friendly experience, it's rarely used in end-user applications. That means that our cure sent from the user to the dispatcher will be hard to understand if we need to introduce some kind of secret. But, even this makes sense. At the Stratio, we think different. We say, okay, by default, most users are not used to play with big data things. So, let's do as easy as we can and hack something with that TLS thing. We need to handle that in order to be really, really easy. And that's a huge challenge. Okay? As you can see, this is an example of a driver, okay, using our distribution. That means that the only thing that you need to do to run a driver using a TLS data store is get that variables from your Spark configuration. And this will populate some values or variables in order to make that magic happen. How? Well, as Jorge has explained, I won't tell about dynamic authentication and so on. And let's run a Spark application. It will go to the Stratio Spark dispatcher. This will run a Spark driver that will go to Bolt and get a row token. The format will be base64 or whatever, I don't know. Okay? We have different methods to parse, depending on the secret. And with this row secret, we have some classes to generate the proper secrets. The secrets are generated inside the GBM. So are not disposed to the Docker container, are not disposed to the logs, are not disposed anywhere. We control everything here. And the secrets are proposed as paths. So the user can access to the proper secrets. He can only refers to the path in order to be used in a TLS protocol. This happens, then driver gets the token and send the one token to the executor. And the executor does the same trick because you need to identify yourself as a driver and as a executor in order to access black search, Postgres, or whatever TLS data storage you're using. So you could be able to talk with Kafka with driver and executor. That's pretty good. The thing that is most important for us is that we have nothing written in the Docker logs or whatever. It's everything inside GBM. And then Postgres appeared. It's a special case, the special guy of the class because Postgres uses JDBC Connection Chain. And that JDBC Connection Chain does not use PEM files, use PKCS8 files, which is a different way to parse or certificate. We have not implemented in that moment. And as long as we have no method to parse from PEM to PKCS8, we decided to iterate. We iterate with a first approach, which is not good at all. I told you, I tell you, but it's our first approach. That is, Spark has a script that is called SparkM. And that script lets you do things before you start your process. That's not good because it writes a lot of logs and it's quite bad in order to be secured. And we started with OpenSSL, that is a system library, to parse that certificate from PEM to PKCS. But it's about really so bad. And nowadays, we have created a new method to parse from PEM to PKCS8. And we provide this secret the same way that the secrets that you have seen. It's another property that you can use, but it's in another format. And in a future approach, in a future improvement, we want to provide an SSL socket factory that we are working on it. But it's not necessary. It's another way to interact with Postgres. But we think that is quite interesting. So, now about network isolation. Well, before I started with network isolation, I want to put something more about what Marco says. It's cool. It's not bad, sorry. But he forgets one thing. In this code that we are showing, we are talking about elastic. But in the Postgres slate, we saw the code with Kafka. Why? Because the code is exactly the same. The properties is exactly the same. So, any new technology that comes with TLS using GKE's format from the files were covered. What happened? The main trick with Postgres is that this is not allowed. And we have to do all the things that Marco has explained. Sorry for the intervention. And we came with all of this. Okay, network isolation. What happened when we accomplished all of this? That we think, okay, our cluster is now secure. We think that the process is totally secure because all of what we have done are secure. The communication is secure. We don't spend any token. The secret that we generate, as Marcos has said, is in some of the virtual machine. So, anything is shared, always cool. What happened when we came with a particular use case that is called Data Scientist? Data Scientist have to have access from their computer to the cluster inside. So, we open it a way to open cells inside of cluster. And we have this particular use case. So, let's put that we are this guy. We are some regular guy that wants to launch any Spark application that reads from a Keralis HDFS. It's run its own application. Communicate with the Spark dispatcher that will run as Spark driver and Spark executor. Will interact itself and with HDFS. But what happened with some bad guy, this guy right here, want to launch a process that cannot be done with this backdoor that have been done, that have been opened. With this hot process, as he is inside of cluster, he is not also able to communicate with HDFS because he is inside of cluster. He is also able to intercept all the communications. Okay? If we put some effort or whatever. So, what happened with this? We came with the strategy security team. It comes with integration between ESDEN and Mesos. So, we have now software definition network. And what this approach to us, sorry, what this approach gives to us, that we can create several isolated virtual networks profiling by no bad kind of user, no bad project. I think we can make only... Sorry. A virtual network for the user Marcos, that only access to HDFS. And another one for me, user Jorge, that access to HDFS and Elastic, or whatever. Why? Because my user is better than him, because I'm better than him. So, yeah. I don't learn and say it's not you. So, okay. We can do this profiling of user, or any project or whatever. And one important, one very important thing is that this software definition on networks doesn't add any limitation to the Mesoarchitecture. Why do I put this in knowledge? Because the first approach was using static resource. So, one project has only a concrete amount of resource for our cluster. This is not quite good. This is something that we have to eliminate because we want Mesoarchitecture. So, we added this new tool functionality. Okay. So, happy face. We can assign just the resource needed by each user. And Mesoarchitecture still helps us on Elasticity. Okay. The same process. The same guy launched the same application that we read. Okay. We put two different boxes here. Why? Because as I've said before, Sparkdriver are Meso's framework. So, when we started the Meso's framework, Meso's master started communication with the Meso's framework in a random port that we cannot secure. Securize, sorry. And this network has access to this particular component for architecture. But the executor doesn't talk with Meso's master at all. Okay. So, Sparkdriver has access to the data store, the executor, and the Meso's master. And Sparkexecutor only have access to the data store and the Sparkdriver. So, the communication is still good. No, it is the same. We don't have any problem. So, what happened if we did something, launched a hacking process that as we have this new network isolation, this hacker cannot access to HDFS because it will be profiling in order to only to have access for the proper user or virtual network. And, of course, they didn't have access to communication between executor and data stores or the driver and executor or the driver and data store. So, with all of in mind, we think that our cluster is quite secure. We are quite sure of this that we are going to make a live demo right now with our product, please write for us that all work well. This will be the one doing the demo. If you have two tapes, you can hold the mic. So, first of all, Marcos is going to show you the cluster that we are going to use for the demo. If I can find it. If he can find it, okay. This is our cluster. As you see, this is DCOS. And, well, what we have here is a leader cluster with 200 shares of RAM. We have sparked this cluster already deployed, which has its own network, which is somewhere above, beneath here. So, here you can see that this cluster network is Spark, which, of course, has access to the MesoSmaster. Now, we're going to run a job. Now, he was going to show you the proper web request that we have to launch in order to access to an HDFS, I think. Yes, just something important. We have also authentication, because we are going through admin router. So, we have to log in first, using mouth. And, maybe the cookie that is inside this script has been revoked. So, maybe this face, but don't worry, maybe don't, because the cookie has not expired. The cookie is Office Roddinger. So, here we have the job that must be able to read from HDFS. So, first of all, let's take a look on it. Here you can see what we have talked about. We have to find some properties, which I'm trying to find. This is our version of Spark. And, here you can find some security options. You can see that we have the Spark Secret Vault role, which is the one that is able to interact with Vault. Here you can see the MesoSmaster principle and the MesoSecret, which is the principle is the one that is going to be able to run frameworks and applications inside MesoSmaster. We have the role, and we have the networks. This network, that is not maybe quite readable, has no access to PSQL, but has access to HDFS. So, let's run this script. Hopefully, the cookie is still valid. Just to point out, you can see this script is for a Python application. So, the driver entity 44 has been launched. It's over here. And, as you can see, it has been able to read from HDFS. Here's the record of a guy of our company. Right now, we are going to try to do the same with a virtual network that doesn't allow access to HDFS. As you can see, we have here a different network which has different calico policies that don't allow this driver to access HDFS. So, it must fail. Another point of view, you can see in the bottom of the request that we only share the bold path, never the secret itself, okay? Yeah, the only thing that the user needs to know is where the secrets are stored, that maybe an administrator or something like that can tell him, okay, I let the secret be at this path and he can reference it. So, the driver has worked. It's the number 45. It's submitted. And when it decides that the timeout has come, it will fail. When will that happen? No one knows. Well, we hope soon. Let's go with Postgres. I will just keep some part of the cure. I will show you only the important things for GLS. So, here you can see that this is a Scala application. Here we refer the jar and so on. And here you see this is the one that allows to access to PSQL. So, it has a policy that only is denying the access to HDFS. We have the same secrets but we have different paths, okay? We have the paths for the interaction with GLS at the store. It's always the same. It's always the key path, the third path, the third path, and the same for the trust store. So, if we run this, okay, driver is 46. The other one, let's see it before. Okay, as you can see the driver launched without permission to read from HDFS have failed. And the one that we have just run, the 46 with access to Postgres, has ended successfully. So, now, let's see the driver that has no permissions for HDFS. For Postgres, sorry. Yeah, sorry. I don't know what I'm saying. So, this one has no access to PSQL. And this will take a little bit. But this will fail, okay? I want to show you first that you can see here. If you find for it, if you look for it, sorry. Well, not this one. We are not interested in jars. Okay, sorry. Let's start. So, if I can show you this, it will be interesting. Here's the network parameter, which is Spark No PSQL, which is a Calico network that we have defined with the policy to deny the access to PSQL. And the service, this already running because it has a long time out, will fail. Okay, we are pretty sure of that. Okay, just to point all of you, in this log, also we can show something. Okay, all of these properties. Okay, and now part of the Sparkon properties. This is not, this is doesn't, it's not secure. Doesn't print any properties that have something like password or whatever, please. This is your accomplices developer. But we want to show you that even we didn't put any of these properties inside of our web request. We have all this information. And all of this happened in a transparent way for the developer. And I think it's very cool. And something that we doesn't say is that all of this, we came with benchmark and doesn't add any delays. This is the guy that ran this benchmark. Of course it has some impact, but it's something that you can just proceed because it's not important. It's lower money than we expected. And so well, now while this is finishing, we would like to ask you for questions. If something is not pretty clear and you want something to be asked, it's the time. And you will have an awesome dessert. Questions? Yeah, they told me that I need to give you a microphone. Wow, great presentation. Can you talk a little bit more about how you do the SDN? You mean about SDN? Yeah. SDN, we are using Calico now because Calico give us a lot of power when we are talking about security policies. It had a huge and a massive way to define ACLs. And that's the reason that we choose Calico. SDN is important to us because we have tons of processes inside our cluster. We have business intelligence tools. We have processing tools. We have data stores. So we need to isolate those things without a huge impact in our core architecture. I mean, I don't want to say, okay, Agent 1 is for business intelligence. And Agent 2 and 3 is for processing. We want to get over that. And we use it as SDNs. That is a cool way to isolate containers. Is that a good question? Thank you. Thank you. I will read you the teaser right now. So, your question? Thank you for coming. Yeah. Thank you for coming. This is the slide mandatory for a company. If you want to be part of this episode team, please send us an email with your curriculum and we try to help you. So thank you for coming and enjoy the conference.