 Hello, everybody. I'm Zen. And I'm a software engineer, so today I'm going to tell you about a bit about Mesustra I am. And I'm working for Shipset Media Group. A bit about myself. I'm a Java developer before and like from last year, I'm just doing development go. So let's go. So a bit about the company. I'm working for Mesustra Shipset Media Group. Maybe nobody here has listened about Shipset, but it has like another phase. It has almost like 32 products and self-presence in 22 countries. And I will tell about like marketplaces area that is more famous. For example, for the people in France, it will be famous as Lebanquat. And the people in Austria, maybe it's like Spoke will happen. And people in Spain, it's in four jobs for the Casa. So it's like it's all over the Europe. And it's doing pretty well. But as a developer, I will say it's a pretty good company to work for. So a bit about the team. Yeah, we are like six people in the team. And we are the CR team, the common runtime environment team. And we all based in Barcelona. And we have, I have got two guys there. They have decided not to pass my slides today here. I don't know how I'm doing something wrong for sure. But yeah, and later we have a contributor for Mesustra I am. He's a guy from Valencia. Apart from development, he knows how to do a nice paella. So you can discuss about development, but about paella is always right. So you're going to lose the conversation. So let's go. So as a team, we are like not so much common runtime environment. We have actually three clusters about of Mesus, Hadoop and Kubernetes. So I would say that we are common to say that we just want to maintain like all the company have only one Mesus cluster, one Hadoop cluster and one Kubernetes cluster. And about the, and we are just onboarding teams. So like not all, if somebody from Schipster would say, I don't know where is this. So maybe it's still not onboarding to our team. And we have a long way run for the Mesus. But for Kubernetes, we're still pretty starting. So as for the workload, I would say that we have like, this is about the daily workload. So it's like 6,000 jobs run in Mesus, more than 15,000 jobs in Hadoop. And on daily basis, we have average like 2000 ports running in Kubernetes. But for sure, this number don't say anything. So I will just put this, but I'm going to focus on the Mesus part today. So for the Mesus part is like we have almost like a GP per task, like one task can takes about like 15 minutes, more than 15 minutes. And on average, it's using like 1.4 of CPU. So it says a bit more about the cluster and not just like empty tasks running 6,000. So as a team, like I would just like to say that as a team, we have achieved the auto scaling thing. That's pretty cool. Let's talk about the, like we, our Mesus cluster is continuously going scaling up and down. So, and we just keep it tied to our usage. But the cool thing is like we have like zero failed tasks, like when our scaling down. So basically why, why, how we have done that is basically we have implemented the maintenance primitives in the all Mesus frameworks that we were using. So actually we are the creepy guys that are going on the GitHub and asking the frameworks, hey, can you do all of this? And as far we have done it ourselves for Marathon for currency, because it's open, but we are already using it in the production and for arithmetic. Like for the guys who don't know arithmetic is a pretty small framework done in go. And it's really good for run one's task and is working quite well for us. So yeah, let's go and if somebody is like more interested in like how this is going, like I'm just talking so fast about the scaling and zero failed tasks. So if you guys are interested, you can check out this project that we have done in our team. It's called that note and it helps a bit more putting the notes in maintenance and then takes care of killing them and going down. So for Mesus to I am, let's go for Mesus to talk about it. I'm going to skip the Mesus part because I'm going to presume everybody here knows what is Mesus. But if somebody is going to ask me like still don't know what is a Mesus, I would say that it's like a open source cluster management manager that handled workloads pretty well in a distributed environment. And most important through dynamic resource fearing and isolation. So yeah, let's go to the EM. EM is actually the Amazon service. Before I was hearing in the other presentation like 78% of Mesus cluster run on Amazon. So I think pretty much everybody should know about that. But before doing that presentation, I already prepared about to talk about EM. So explain about a bit about what is the what is stand for. So for example, EM stand for identity and access management. So for me it's just take care of who and how like who can do what and how they can take care of that. And it's basically the authentication and authorization part for Amazon. But the most important part is like it's quite extended its usage. So it can you can define EM roles based on action like I can do on what kind of I can do a get put or delete that kind of actions. But moreover, you can define it on the resource level, like on S3 data or your Achieve. But Achieve is more like a service based role, but and then you can tag your roles based on the roles. And what is most important for this presentation are the temporary security credentials. So it's these are like short term credentials that the EM provides you and you don't have to like kind of distribute your credentials. You have to save it in a security wall because they're just temporary for the current user for the request you have made. So I will give a little talk like about how it works. But in the case if somebody knows, let's imagine we have a like a bucket in S3 where we save photos, super secret photos and we don't want everybody or every user to get access to them. And we just define our service that has to access these pictures. And we define a role, EM role that is allowed to have access to create, delete or whatever to these pictures. And then later for sure you can assume this role through your local machine or through your credential, but we don't we don't even want that we want only one of our production server can access to these pictures. So we define a like instance profile and we allow that our role that can access to the pictures can be assumed by this instance profile. What it makes in the end, it's makes like everybody that is on that instance can assume the role of the pictures and actually can access data but anybody that is outside that instance cannot do that. And this is done by the feature of assuming role and where they will have like in Amazon have like a STS API, STS stands for security token service. So it's kind of pretty cool feature and for sure we are using it. So our use case. So we have like a Mrs. Cluster and we have our frameworks that are running. We have Marathon Spark. I'm not going to talk about like there are a lot of applications like Luigi and Airflow also. But then later we have a strong API that is like our taking care of our security. Okay. So nobody can actually create a job there. We have kind of like cluster admins that take care of creating a job. So this jobs are hardly associated with roles. So if I create a job like to access my pictures, somebody will give me access and make sure that I only have access to this picture and I don't take any other role. And then this the problem is that here we have a cluster that can access a lot of data. Like before they were saying the data is oil. So we have like more than one petabyte of oil and user can accept this and do their jobs there. But then later we had this use case came like a couple of dot engines came to us and they were doing their jobs on Jupyter notebooks. Jupyter notebook is like a scratch book if somebody doesn't know. And you can do your stuff on it and like launch Spark jobs. You can do your stuff in Python. You can load there the condos that you want. You can even I like for example, I'm doing the stuff on Jupyter in Scala. And then I can pass it as sex to the in the end. The amazing part is all in the browser. So it's like a notebook kind of and you can let us hear it. So the data engineers came to us and they were like they want to set up like a Jupyter on our messes. And actually in this year, the messes con Asia, they explain about how to set up Jupyter hub in messes. It's nice talk. I would recommend that. So what we did is like the idea was so we modified a bit the Jupyter marathon spawner. It's on GitHub and to adopt it to our authentication. So in the company, we have like a portal kind of and anybody can access developers and user can access to this portal. And once you have access to this portal, we make sure that or the identity team make sure that you are identified and they give you a token and everything like that. Of the oath. So we have to modify with the Jupyter hub. And we did also like a small it was like a small reverse proxy and go to also take care of the the set up part of to because the Jupyter have like a kind of admin console and you can access to it. So we wanted to modify a bit. So we just did a bit of gateway there. So in the end, like the our happy data scientists wanted to come and deploy and ask to Jupyter hub. And it's like a dashboard pretty cool. And ask them, hey, we want to deploy our notebook. It's like kind of we call it workspace. So they come to their Jupyter hub and the sheer Jupyter hub and ask, hey, we want to deploy. And then everybody gets their notebook. And it's all from this. This is like a isolated but everything is later deployed in the same cluster. I was talking with them from once they have deployed their notebook. They can launch their stuff to this park. They can even if they want, they can deploy to Kronos. They can access to a lot of oil. And this is pretty amazing. But the thing is like here. We have got a situation like we have a missus. The role we are you were using that we have a strong API that was taking care of that part. Nobody was able to zoom another role. And because it was pretty much impossible as it was defined on the job creation part. And the job creation part was done by people by the cluster admins. So but here we had like like a one notebook could assume the role one and was assigned to the resources that were part of the role one only. And the other notebook, the other user have access like I will just say like for example a team should not access the other teams data that were in there. And only access to data that the role were assigned to. But the problem with here we had that anybody could access like role. Once inside the notebook you can actually start a terminal, assume another role and just see all the data that you should not see. So the isolation is like pretty much gone in this case. So we could not do that. It was like the that I was explaining because the users could access to the people's resources. And this was a pretty much serious case for us. We could not able to do like deploy the Jupyter in our cluster because this would open a world cluster. The problem was for sure the prodigies instant profile because anything that is running on our message cluster. If we have the same instant profile they can they just need to go to the Amazon console or any other way. The EAM profile the other team has other users have and they can just assume it and it's like just pretty simple. And that was a no go for us at least. So yeah, sorry. So yeah, we have, sorry, we have talked about the cluster. So one part we have the cluster already decided, but the other part was like what we are using to assume the roles. So what we are using to assume the roles and everything is like the EAM. And so let's take deep how it actually works. So like easy to the credentials like how the M role is getting for the temporary credential we're talking that just goes through these two options. Like years ago it was just only one option, the second option. But now they have also introduced the first option also. So like one option is like you have the instance metadata. You just carry on it and you will get your credential for the roles you want. And you don't have to worry about it. But another option has been added like that's another endpoint the SCS the security token. That if you use you can just carry on this endpoint and you will get your credentials. And it's now the actually default way to get the credential on the all the SDKs. After the version like 1.11 0, it's like the default option. But yeah, for sure there's a small letters. Maybe you can read it. I'll read for you. This is that there's a like a environment variable that is AWS Container Credential Relative URI. So if this environment is not set in your container or where you're running, it always go to the option to that is the old option. So if you want to force is you have to set it. But this is how actually the emrol for ECS tasks are working. So ECS task is like the agent just put that environment variable. And it populates it with the that the credential provider version and with task UID. So in the end it's up end up doing like that makes sense. Like the IP address like one the 172 to and then credential provider and it's also passing task UID. That's pretty much cool because you can assign roles on task level. And you can take care of the one task cannot see the roles of another task and you cannot get credentials. So we also want that because but we were pretty far to migrate to the ECS tasks. So we wanted to do in our cluster. So we here's we started with the message to I am so in the end like for a task or anything that's inside the cluster. It's pretty much transparent because you do the you do the call. And in the end what you get is like the standard credentials. These are temporary credentials but you can get your exercise exploration role secret. And so this was a pretty nice way to go. So here we decided to make a bit like a combination of the measures and EM so let's do. So what is message to I am it's actually just a diamond because the we it's open source. We did it open source so we can other people may also use it and it runs inside the message agents. And in short word it just give us back control of the I am policies on task level. So the last query that I showed up what he was doing. So I want to take control of this and we want to make sure that no tasks can see the other task roles. So in the this hero code of the message I am would resume to this like we manage it to tables. And then we retrieve a task ID. I'm saying task ID here because it's custom so it can be like a party or whatever you design or which level you design define. And then fetch current initial for the task and returns to the to the container. And to the manage IP tables. If you watch the court it's just basically two routes like pre rooting and then forward. That's it. But I will better explain it with diagrams because it we can understand it better. So for example I have now a message agents on a slave that almost know I am privileges like it's naked. It can just barely do cannot access to any daughter. It cannot access to any service. And then later we have our task running on top of it with messages to I am agent. So when we do like a like when a task wants to get retrieve the credentials out of the. To the that URL that I was saying before it just forwarded to the message to I am. Okay. And then message to I am go back to the container and it's fetched the environment variable that can be that is a custom. Let's say container ID here and it's get the container ID and then this is the important part that we have another host that has installed privileged instant profile. So that has more privileges than the agent message agent. And for this we have also open source is Naive API because this we don't want to put our we want to we don't want to make an opinion because depending on company and it's different kind of stuff. So we have put it like a Naive API where you can define by your simple file which task which ID a correspondent to which role. But here the important thing is like we have given the roles the. We have allowed our roles that are being going to be used in the cluster that they can be assumed by this privileges instance profile. So it goes to the it when we have the container ID task ID or a party we go to the current host and we will drive the credentials. So actually it assume role in the assume role is the actually the temporary credentials operation. So assume role you turn the credentials to the message time and message time return it back to the task. So the for the task it never knew what was happening. Yeah, so it never knew what was happening. It had the credentials and it was pretty much transparent to the task. So I will yeah I have prepared a demo here. If I can show you for sure a video because I don't want to risk it. So here I'm going to run a simple container that has nothing special. It's like just had it has like Amazon client. It's the Mesosphere docker and the border thing is like I would like to say that I am putting the container ID. Some random UID and then the I have also set up the Amazon connection relative UI because without it it would go to the normal endpoint and as it is naked instant profile of the machine. It wants to see anything. So when I started and in this docker they have already installed the Amazon so it's pretty awesome. I just get access to a bucket but I don't get any because the Amazon is saying me that this is not ACS task. You cannot retrieve any it cannot find any task UID and nothing like that. So I go and just start the message to I am here and this is my the smoke that is also open source. But here is the name version because and I started also when I come back to my the docker that is running inside the message. It just now I can access to the data. So like now I will explain what happened here and like now can access to my super secret stuff and stuff like that. So like what is to happen here is like if you see in the screen there is a container ID. So I am putting this container ID like why it's not a pie your task ID. It's because I started the message to I am with a prefix that is like container ID. The prefix you can use it a pie the task or whatever and it's goes and will drive from the docker the same. And also I put a pretty random UID but it was not that random because then we have the credential host and we can see it from for example here. Yeah. What was I started with the roles. So what was in the role it was just UID and it has the role that it was supposed to assume. So like on one side I have my API credentials API that is taking care of which UID is like going to sign to which role. In our case I want to be sincere we are not using this API. We are using like some random generator and it's we are saving it in the as for now in Dynamo and we are planning to move to the strongbox. So that's it. So now I have started the message to I am with the verbose thing and it will show like everything like it's in the host mode for example the API the IP address and also the container ID that it used to retrieve from the container. And actually it's not necessary the host mode because later I will do it with the bridge mode but and it was just to show. So later like if we go through the video. Yeah, because when I was doing the demo I just started in the host mode but we can remove it the host mode and we can just by default is starting in the bridge mode. So we can just for example here just try to get the the bucket that I only have privileges for that bucket. And if we see on the message to I am logs it will be yeah it's in bridge mode and the same container ID and that's it. So yeah I think that's okay. So these are the the repos where you can contribute we welcome contribution because there are a lot of to go because we are lacking totally support for virtual network like Calico. We can still have work to do. And moreover right now the container ID task ID API whatever we define it's quite visible inside the docker. So we are just working on to use secrets like today what they were telling about how to use secrets in message in a to find a way to hide this. And this is that's it. So if you guys have any questions, I would be happy to answer. Yeah, hello. Hey, thanks for the talk. I was wondering if you could explain a bit more about how you maintain this map between like in the example it was a container ID. Okay, and and the credentials. And so that seems like key to how this is secure. Okay, yeah. Well, in this case we did it for Jupyter up so I will just pass really fast that part because I was pretty sure nobody's interested how like internal working. So we have I told that we have like a portal where we have identity team that makes sure that give credentials that are nothing to do with the Amazon credentials for sure to for every user. So in that part we have like a we inject the user names. Okay. So like every user have with their email injected in the spawner of the marathon of the Jupyter Hub. While we get the user we go like authenticated API and that's like pretty like as for now we generate on it on fly and we associate the user we know that this user has to be mapped with this other role. We create a entry in the our credential store that is like on that's open, but for sure then later we create a random you ID and pass it to the to the when we are starting the task. And it's deleted when the task is finished or after some time if the task is running long. Yeah. Thanks. Okay. Any other questions? No. Thank you. Okay. Thank you.