 So it's a mic. Yeah So hello everyone My name is Marco and I'm with Daniel Gonzalez note nagle and We want to present today at a Question how can can you deploy a big data or iot system on to an open stack cloud? Basically, if you're let's say have a High-level view on it. You would consider an iot application as a state-of-the-art architecture or bleeding-edge architecture with the newest Architecture rules following cloud native approaches and all that but in particular if you have a much more closer look You will see that hosting a big data system itself has a lot of issues and In within this talk, we will focus on the storage part And we will have a look how we can deploy a big data system on top of open stack Basically why is ASAP at all? Concerned so we are we are using open stack as our infrastructure as a service this is our Main target for for our data centers. So we are trying to onboard all our workloads on top of the open stack cloud one of our customers or stakeholders is SAP HCP HANA cloud platform and HCP is Something that you can consider as a platform as a service. So it consists of a lot of microservices and also a Variety of databases and and also big data systems to build for our customers and Business analytics suite. So in particular, you can build IOT and big data systems on top of that So for this talk, we will focus on just pure open source software just that we can have a really close and deep look onto the Technologies and how we can really use them within the cloud So we will use open stack. That's obvious As a workload, we will use Hadoop just to have a closer look and and to show how we can let's say in particular bring this To open stack and for storage, we will have a variety of storages, but one that we have a closer look is also safe So at the beginning we we ask ourselves Is big data a cloud native application at all? So does it follow the the principles of a cloud native application? So the first question is what is a cloud native application, right? so We we found a good document at the open data Open data center Alliance Which is basically a consensus of an catalog of things that an application Must guarantee that it's a that it's can be considered as a cloud native application so we pick here three aspects of such applications to assess the the solutions that we are presenting and Having a closer look whether this follows the cloud approach or not so first one was is a Scalability, so which means also quite obvious if if you have high load You want to scale up want to bring new nodes to the cluster. This should be balanced Automatically, so this is the scalability failure tolerance So sure if something fails with one of the nodes fails. There should be also automatically take over or balancing of the load and what is really important for For us as an infrastructure as a service Group, it's really important to have infrastructure independence So it means we don't want to buy dedicated hardware for a for a Workload, so we don't want to introduce graphic cards and our compute nodes for instance Just because one application needs it so these three Measurement critters we will use to have a closer look on the deployments Whether they these are fulfilled or not and at the end we have some conclusion here Basically What is a big data systems? And I think it makes sense if you want to design a storage that you have at least an idea how the data The data flow behaves so and basically for big data and iot systems you have two faces or two different Way you access the data the first thing is you're writing data because you have a You have a network of sensors or you you have Yeah, a network of data entities that want to write its data constantly to to the big data store so Basically from a data profile. This is write operation and it's basically a constant flow of data. There won't be that big Peaks it will be a constant flow. So that's one of the data profiles on the other side You want to do some analytics on top of that you want to see how the data Looks like that you stored right? So basically here the data profile is a bit different Here we have on demand because basically somebody pushes a button and wants to see the results So we have it's not a constant flow. It's some somebody wants to have in real-time an answer and and and Maybe a graph and it's mainly read operations that you have so and From let's say architecture of you these kind of analytics application can be stateless and these can be micro services Here but I think the the the main issue In hosting a big data system is the the data storage, right? So how do you put where do you put this data storage in the cloud? Yeah, just to notice within this data analytics phase You also have to write back some results, but they these let's say are really the It's not the the major of the of the work that will be done in this phase. It's just it's just we are having some writebacks here So we wanted to also have a closer look to to HDFS and how do so I quite I think that the concept is already quite known So basically you you have clients it can be your map reduced job could be a client For HDFS it could be something on top. It could be a sensor that it's writing data Basically the client always asks the name node where to retrieve or where to put the data So for read it will be There will be a reference to the data node and the client will directly access the data node And read the data for right. There's something special So if you write something to a data node, it's automatically in the back replicated depends you can Change the setting here, but usually you replicate it across Your data nodes So and this is something that is also special so you you have your big data system and there is already replication ongoing and Your storage most probably also do replication. So this is something where you can see That could be some some problems with it But basically if you just have a higher look on on this on this HDFS big data Hadoop thing You would say yes all the three Measurement criterias are met so it's scalable you can add new data nodes and it will reveal rebalance automatically It's failure tolerance. So if one data node fails the other will take over that's not the problem and in theory It's also infrastructure independent because it doesn't matter whether you put this on a Dell or HP server So you can choose your hardware. So basically you can say yeah We are and the presentation is ended now because we all fulfill the criteria But the thing is if you now have a closer look to the problem, you will see that it's not that easy so Basically, we have now the question how we can remove this big data system into the cloud and Daniel will give us More details on that and and give a deed an overview about the deployments that you that we can do Thanks, Mark. So basically when you want to move such a big data workload in Hadoop cluster into the cloud we have several possibilities on how to deploy this and The first possibility we want to have a look at is bare metal deployment using ironic So of course on such a solution most closely Resembles a classic Hadoop cluster where you just deploy your stuff on on bare metal servers You have that one touch when using something like this that you have to direct access to the hardware to the discs, etc So you basically have the same performance as with a classic cluster But there are also some things that you have to keep in mind So for example, if you want to access the cluster from clients living in VMs So for example, if you want to schedule your map reduce drops in VMs outside of the spare metal cluster Those VMs usually live in a neutral network. So they don't have direct access Over the layer too. So you basically need some kind of routing between it and this may become bottleneck Also, and this is probably the the bigger thing here If you want to give each of your tenant their own Hadoop cluster You're basically back in a pre cloud world where every tenant has their own hardware And if they want to scale they have to buy a new hardware, etc. So Deploying HDFS or a loop cluster in the cloud on their metal usually only makes sense in a Multitenant way where all your tenants share The same HDFS But if you do something like this all your tenants will share the same physical resources And that might be a problem for your security because everybody can access Basically the stuff from from other tenants So if you look at the solution with our three criterias for cloud native applications We basically see this that this solution does not really scale because in order to add new data nodes to the cluster we have to add new hardware and When we are in the cloud, we don't want to add new hardware just to scale We want to boot up a few VMs or so so we cannot really consider this as a scalable solution But we are still fault tolerant or failure tolerant because the HFS block block replication Works here. So if a node dies The the other replicas will take over so you don't have a problem here Of course, since we're using bare metal nodes here We are not really infrastructure independent because we have a direct requirement on our hardware so now for the for the next type of deployment we want to tackle the problem of scalability and Obviously to get scalable here. It makes sense to move the data nodes into VMs and For the storage here the first solution that comes into mind is just using fmr storage Which is provided with each nova instance um But using fmr storage as a back end here for hdfs also has this problem because usually with fmr storage you just have some kind of capacity and The nova scheduler cuts just cuts out some capacity from here for your node So for example, if you use the standard file system back end You every end will just create a file in the file system of the host and use this as storage with the LVM driver for example on You have a bit more flexibility since you have more disks under the hood here But still you have the problem that all your data nodes are accessing the same physical resources here on the host so If we have many of these data nodes this might Impact your performance when all data nodes compete for the same physical resources here and Also Because of this it might make sense to schedule your Hadoop nodes independently from your normal workloads so that your normal workloads are not Affected by the by the performance of these big data nodes So it makes sense to create a dedicated availability zone just for your Hadoop nodes So they can be separated from your normal workloads Also another problem is that Usually you have the HFS replication to ensure that you lose no data when a node goes down But now since we are scheduling multiple data nodes on the same physical hardware it might happen that a physical host dies and takes down a whole replica set with him and That would mean that we lose our data But luckily the nova scheduler provides a feature here called anti affinity so you can create a server group in Nova and Configure it to use this anti affinity feature and then Nova will ensure that your VMs are spread over all your compute nodes so As long as it has three nodes it will schedule your VMs there and not on the same same compute node So you can minimize the risk here to lose data when Compute node dies and just to show you that this is really easy to achieve We printed the Nova boot command here. You need to to boot a VM in a Server group, so you just pass the group hint here to the Nova scheduler And tell them the availability zone and you basically get a data node VM that is booted with anti affinity Activated and in its own availability zone So If we look at this more closely here, we see that we have gained scalability now Since um, yeah to to add more data nodes. We just can boot up new VMs We are still failure tolerant, but we have to keep in mind to use the anti affinity feature But since we are still bundling our storage with the compute nodes We can still not say that we are infrastructure independent here Um, because we have clear requirements in our infrastructure all of our compute nodes Basically also have to provide the storage needed to run our data nodes But before we tackle the problem of the infrastructure independent we want to show you another solution Which tackles the problem of the the performance because of using the same physical resources on the host So instead of using the fmr storage of VMs You can use Cinder here Cinder has a so-called block device driver, which allows you to use raw block devices as volumes, so This allows us to pass local disks directly into the data nodes This of course has one limitation We have to ensure that Cinder schedules its volumes on the same host as the data node runs on Because since we're using raw block devices here, we can just pass them over the network or so So we have to make sure that this happens But luckily this Cinder scheduler also provides a filter here, which lets you give him Nova instance ID and Cinder will ensure that it boots the volume on the same host So it's also a problem. That's already been solved But you have to keep in mind here. We have now a better scheduling We have can ensure here that the VMs won't compete for the physical storage But we still have the problem with the replication so we still have to use anti affinity in and again We printed the commands here needed to To create those volumes with Cinder Yeah, and the VM beforehand And how to pass the the local to instance in here to the the Cinder scheduler to ensure that The Cinder volume is created on the same physical host So three cloud native criteria here basically look the same as with fmeral storage But we have to keep in mind here that we now have a better scheduler for our storage which allows us to To schedule the the storage Behind it better so that our VMs won't compete for for the same disk basically But still we have to bundle all of our our storage with the compute VMs compute nodes So we are still not infrastructure independent with the solution. So how do we gain this now? So basically as a basically just said the solution is to decouple storage and compute resources here So we're doing this we Achieve to make the data nodes completely independent from the storage back end There is just one thing you have to keep in mind when using a proper storage cluster within the year Because most solutions provide some kind of replication mark already said it You have to take care that you configure replication in HDFS and the storage back end accordingly so if you have replication on an HDFS and also in the storage back end you will replicate your replications and this will Severely interfere with your performance when writing data because you write you have to wait until HDFS has replicated All its data and then you have to wait until the storage cluster has replicated. So This is not optimal, but basically if you Keep this in mind and we have a solution here in theory, which is scalable because we have VMs which we can just Put up and down as we need We are failure tolerance because of the replication either in HDFS or in the storage back end And we are now infrastructure independent since we have decoupled the the storage and the compute resources we need But of course, this is just a theoretical model here to look into a concrete example for such a storage solution I will now hand back to Mark and He will talk about self a little bit. Okay Yeah, so basically it's the very same it it looks really nice on on the surface and if you know Having a more detailed look we will we will face issues. So as Daniel already mentioned so one issue is the the replication so one easy Solution would be we just switch in SEF the replication level to one Then we have three times and maybe in the data nodes and just one time in SEF but this will cause the issue that you're you have basically a Much bigger failure to main because it's completely not replicated. So if one disk fails, it means that you could have a data loss so basically This is something that that that you have to consider which is the right set of replication level and both systems In general if you use SEF, it's just a sender driver. So SEF RBD Then it's the very same for operation. So it's just creating new sender volumes So one other thing is also a functionality that I want to do to highlight here Is quality of service so in sender there's the possibility to To create quality of service rules or rate limits. It's just Basically limits the maximum data rate And there's two ways That these these rules can be applied one is just for lip weird. So it's called front-end. So it's the limit and within the SA lip weird hint to be or iotune. I think it's the it's the parameter So here it the virtual machine by itself is limited to do not write that much data There's also the possibility for sender to have back-end Quality of service rules, but this is not supported by SEF but by other sender drivers So one let's say issue is what happens Let's say we could also switch the replication down, but what if In cinder you don't if you create a volume in cinder and this will create in in sef Volume you don't have control where this volume is Is created so it can be that all the data nodes accessing the very same or OSD Which might be also the same hard disk? so Basically if yeah, if you have a hard disk and you you have to wait until the head has to if the head has to jump it Will really slow down your whole system. So basically this is a a bottleneck situation And this needs to be avoided For sure so one solution for for that is grouping the the OSD is in sef So maybe I just have to also give some details what an OSD is right so Basically in sef you have two services a sef monitor that monitors was the overall cluster and you have Storage demons you this this the storage demons needs to have a Connectivity to a device it can be a bunch of discs. It can be just one disc Maybe to make it easier just consider it as one OSD is one disc maybe and basically on top of that sef has the possibility to create so-called crush maps to logically group The storage and to a bit control where the storage is located So what is what is possible with sef here is to group to create three pools? and you just add Parts of the OSD of the OSDs in your systems. So with that you you really can avoid this kind of This kind of bottleneck situation where all the data knows accessing maybe the very same disk because you completely group them away So within this example the idea would be that you have three pools and you have three data nodes But it can be that you scale your data nodes up And then you just you would need more pools and this is not let's say fully out it made it an open stack So this is also let's say a problem that that is not solved or not that easy What what you can do is if you know let's say the average of the data nodes That you have you can let's say at least build that many pools that you need But in general you don't have let's say an automatic way to provision these kind of pools The idea here also is let's say to have a default pool where let's say the default workload of Of cinder is based so with all the OSDs So you have concurrency but not the current currency on your deployed HDFS deployment one other thing that I would like to Would like to highlight is performance for sure. That's really important So if you have a set cluster that consists of real hard drive Hard drives, then it's maybe a good idea to have a closer look to cash tiering so so if you add some a bunch of SSDs you you could use them as a so-called cash and and So basically the data will be a cash before writing and then it will go to the slower hard drives and write the data the setup is complex and we are About to test that but we're we're not have let's say And working into end solution here So it's just let's say something that you have to keep in mind if you're facing some issues with performance Here for cinder, it's also worth to mention If you create more than one pool, it's it's just it's for cinder a back end So means you need to use the cinder Filter a different back-end filter. So with that it's the very same on the Nova side You can control that the this this HGFS node does not use the very same pool as the others before So it's the same context in Nova So Basically with this Yeah, we can we can say yeah, we're we're now achieved our goal, right? But I think it's obvious that the whole system is really really also much more complex than the others So local disk is much easier and does not have that many levels Basically, we we can say there is Scalability is is available, but the thing is that you that you need to create is this pools in front So it's it's not really 100% Yeah, the system is failure tolerant as the others before and you achieve the infrastructure independence by using the Default a storage that you have in the cloud So Daniel will now show you the conclusion just a summary And then we will also have a closer look to Sahara because Sahara is let's say the tool of choice if you want to deploy an Hadoop cluster and we will we'll see that all the things that we presented here is also possible to deploy with with Sahara Okay, so in summary these are the different deployments that we have just shown to you As you see with bare metal We have the problem of scalability as we mentioned. You're also not really infrastructure independent here By moving our workloads into VMs. We achieve the scalability So this is what you can do with Nova FMR disks or cylinder with the block device driver But what you have to keep in mind between those two solutions is that the block device driver provides us with a better scheduling mechanism so that we can Use different physical drives for for our data nodes so that they want content for the same physical resources and The last solution the the Seth RBD driver for Cinder In our eyes the the best solution and fulfills our three criteria Yeah, with a little problem on non scalability that we need those dedicated pools To run the solution here so Yeah, as Mark said, let's have a Short look at Sahara here and Sahara Uses this concept of node group templates. So you basically define in the template How the nodes should be configured in your in your Hadoop cluster? So we have two examples here The one on the left is configured to use FMR storage So just by configuring is it to use zero volumes per node You basically tell Sahara to deploy it with FMR storage The example on the right is a little bit more complicated as you can see here We tell Sahara to deploy the node with one volume per node which should have a size of 10 gigabyte and Through the volume type we can tell Sahara Which kind of bag and we want to use here So here for example, we have the block zero one volume type Which maps to a to the block device driver in the back end and you also need to tell Sahara here that it should Scheduled the volume locally to the instance So that's what we talked about and when you use the block US driver You must ensure that the volume is scheduled on the same host as your as your data node So just by configuring another volume type here Which uses the set back end you could also use our third solution with Seth But of course you then have to leave out the volume local to instance Configuration option here because that would not really work with Seth You don't have to do all this through those templates. It is also possible to Create those those templates not in jason, but in horizon So that's basically the GUI for this so as you can see you can configure everything here the availability zone for Nova for Cinder Also the volume type etc. So it's basically pretty easy to to set this all up Okay, and then just one last slide about what we plan to do now So we want to have closer look at those different storage deployments Especially on the on the Cinder block device driver and the SF RBD driver Want to measure the performance of them and compare them more closely and then later on we want to Identify any problems with the block device driver and Enhanced that driver for example by by adding better scheduling mechanisms for it and Another thing that we want to do in regards to the RBD driver is to Publish a crush map example so Yeah, and also to publish a white paper on this topic Okay, so So basically we're done with the presentation We're open for questions if there are any I think there is a mic Have some hi, thank you for the talk. It's very interesting I'm Terry. I'm from a orange cloud and Claude what and we're working on big data. So we have the same type of problems or things we are solving When when we are in a big data cluster The quantity the bandwidth of writing to the disk can be very huge for each single individual note so Somebody may even wants to write like one or two gig of bytes per second to the disk per host When you go fmr all our local disk within their block then Everybody is on its node, right? So doing it on its note, but when you do it through Seth then all this goes Central somewhere. So what do you figure out the network and then 3.2 the safe storage because if you have like say We're public cloud. So everybody can run. It's It's cluster, right? So imagine you have 100 nodes or 200 nodes then if each of them try to write one Gigabyte per second then to the central storage then Well, I don't know to figure it Basically How can we answer this question? It's it's basically yeah, you're limited by the network That's that's for sure because with a local disk you have your performance of the local disk and and and that's it And with the network you what you can do is yeah You can you spawn a dedicated storage network with physical links, but at the end you will be limited By the network. That's for sure What I would say is what makes sense is to limit this maximum rate to a limit So that you're not completely DDoS your your self cluster But yeah, what you say, that's totally right. That's something that you have to also consider local disk has a certain IO rate and and that's given and over network You do need to have more more a detailed look also to the network here with this talk We are really focused on the storage, but yeah, you're I fully agree to your point So what type of a bandwidth have you put towards the safe storage? Yeah, we're currently in In building up let's say a benchmarking facility where we try to find this out So we're not in a phase where we can say yeah, this is the right limit So that's that's the things that we want to publish next to have more detail. What is let's say our assumption here Okay, so thank you. Have a nice evening