 Hi, welcome to my session. So today, I'm actually delivering this session on behalf of OVT who cannot make the trip due to some reason. So today, we are going to talk about efficient big data in OpenStack. So before we start, let's take a look on the agenda. At first, I will give you a brief introduction on who we are, what we are doing this, and what's our motivation, and what's the current big data in cloud landscape. And after that, we will introduce our customer as a case study, which includes their skill, their environment, their private cloud and public cloud skill, and why they want to run big data in cloud. And at last, we will show what's the problem, issues we encounter, and how do we fix that. Kind of BKMs for the people who want to run big data in their cloud environment. So let's go to the first part. So we are actually from Intel Software and Service Group. We focus on the big data technology. We do some research job in big data and the cloud solutions. So we have been contributing code to OpenStack Sahara since KLO. We also contribute to other projects like Clodera and other related projects. So what's our goal from Intel's point of view is to help our customers to adopt the big data cloud solution and to use the inter architecture more friendly. So the first question is, why people want to run big data in cloud? So we all know that big data is becoming a buzzword. It is becoming more and more importantly. So it brings the possibility to analyze the data, collect the data, analyze the data, and visualize the data, and transform it into real business values and create real revenue for the company. The second thing is big data is really complex. It's a vast complex ecosystem. There are many, many different requirements and different unit scenarios in big data ecosystem. So to provide those many different services possibly, it is typically in a company that they have to build many different types of big data services. For example, they may use Hadoff. They may use SparkToy for in-memory data analytics. They may use Hive or HBase. So that brings the last question, the ultimate question, so how to reduce the cost? So the cost is pretty expensive. If they maintain two clusters, one is the big data cluster and the other one is the cloud cluster, which is pretty typical. For the big data thing, it plays like an analytic cluster for the company. But for the cloud thing, it's more like infrastructure. So typically for the two clusters, you have to maintain different types of hardware. And the cost is pretty big. And the second thing is sometimes it's pretty possible that there are some overcommissioned in some cases. So you will have to purchase new hardware and add it to either of the cluster. So if we can run big data in cloud, that would reduce the cost dramatically. But this is from the hardware's point of view. But let's take a look on the software. So if we can run big data in cloud, so we all know the cloud has provided flexibility to redesign the cluster, to scale the cluster. So if we can do that on the management part, that would be a huge cost reduction. So to summary, when we run big data in cloud, we can satisfy the new requirements on demand request. And it should provide an easy-to-use solution at a reduced cost. So let's see what current operations for the big data on cloud. So we have Amazon elastic map reduce, which provide a manageable Apache Hadoop framework on the IEC2 instruments. And you can also run Spark to interact with the data starting in their other data services, like the Amazon S3 and Dynamo database. So we also have HD inside on a row. So it's consolidated to the Hadoop, Spark, RHBs, and other services. So I personally like to take these two as the public cloud big data solution. But definitely, we have Sahara, which can create a Hadoop Spark cluster automatically on OpenStack clusters. But we also have Cloud Era, Director, and Houghtonworks. So they kind of provide both public cloud and private cloud solution. So we already have those ones. And never to mention that I don't mention that some other opponents like BlueData. So we already have so many ones. So today, let's focus on how we support one of our customers helping them to run big data in cloud with Sahara. So the first and foremost thing when we're trying to deploy big data in cloud is what items we should take into consideration. So we figured out that typically the most important thing is which approach you want to go. You want to go bare metal, virtual machine, or container. So when you deploy from the cluster deployment point of view, I mean the fourth layer, we have to figure out which type of service, big data service, we want to deploy. For example, whether you want Vacker, whether you want to Hadoop, or something else. But when you move that part into cloud, the better thing is when you need a new big data service, it's very easy to create a new cluster for that. It's compared with the traditional way. And the second thing is the most important thing is on the computing engine part. So which thing do you care most? Do you care about most? So we all know that we started from virtual machine, which is providing the most, the better flexibility, but the performance is the issue. So the second thing is bare metal. We have our own to support bare metal, but we do see a hard trend too on container part. So on the storage part, so that's another question. So it's basically based on whether you want to store your big data storage part into object storage, block storage, or some other storage. So in this pit, what I want to highlight the message is the most important thing is we want to think about which approach we want to run, bare metal, virtual machine, or container. So each have its pros and cons. But for container today, I think it's still not mature yet. So let's go to the customer user case part. Due to some time, the time schedule, we did not pass the legal review. So we just removed the custom name here. So the custom is actually established in 1990. It's a large commercial enterprise in China. It's ranked like the top 500 enterprise in China. This business covers retail, logistics, supply chain, real estate, and investment. So they distribute in like 500 cities in China and have 180k employees. So they have four IND center in Beijing, Shanghai, Nanjing, and even in Silicon Valley. Their brand value is over $13 million. This page tells us about why they want to move to cloud, the motivation they want to move to OpenStack. So the traditional e-commerce company, e-commerce is becoming more and more popular in China. We have Alibaba, we have Jingdong, we have Taobao, and we have several other online commercial companies. So several local, previously we called Retail Company, they also want to move to that way. So that's why they want to build cloud, bring cloud in their company. So this page is actually from the customer. What they think is if they bring cloud to their company, it can help them to increase the efficiency, improve the collaboration, and the new business model. So the one thing they've mentioned is when they have a cloud running layer, and when they have a big data running on top of the cloud, they can do something like analyze the customer's behavior. How long you stayed on one single item and whether you purchased or purchased that or not finally? So they hope that through this kind of cloudification movement, they can do some kind of evolution or a revolution. So they will shift them from the traditional retailer to e-commerce. The second thing is the physical cost, I mean the real estate cost in the big cities is really huge. So that's probably one thing why you can also, like Alibaba is so hot in China. So they can, when they move more business from the physical retailer stores to the online store, they can save a lot of cost there. So this is the overview of their current cloud clusters. They have two different types of cloud solution. One the private cloud and the other one is public cloud. For their private cloud, it's kind of like a multi-data center which run 1,000 host, 10,000 virtual machine, which provided the customer rich customer middleware and for their automated deployment and operation. So for the public cloud solution, it's basically about cloud server, VPC shared object storage and cloud database. Also include something like faster deployment and monitoring and billing. So let's take a look on their OpenStack journey. They started very earlier, like him 2014 main. So they started from a single deployment and then moved to multi-region deployment across different data centers. So it's a pretty typical OpenStack adoption process as what we observed in China. So most of the companies, they started with single deployment doing the testing work and then they trying to move that cluster into product and then they start to contribute and using more and more components. So currently they are QVM and Docker mixed. It's multi-layer application including cluster application, standalone application and intelligent schedule with like 100 virtual machines per week. So they have touched many domains in OpenStack like for HIT they use, at first they use to deploy their web cluster, then use it to deploy their application cluster and database cluster. It's currently they are using it for auto scaling. So they also used Sahara. So they use Sahara to enhance the Hadoop cluster, SAPI cluster, they offload the data to object storage. So I will mention some details in the later part. They use Sahara to schedule jobs and to schedule job, change jobs and do the monitoring and administering work. The thing, the other thing I want to highlight is CHOV. So they actually provided a database as a service to their customer. So they use CHOV, MySQL service and Redis service. On the Docker part, actually this is only the item that they put into the roadmap and all their plans. They do not actually using Docker at this moment. So for the Docker part, they are planning to using it as a continuous CI and a CD for the traditional assets. So the previous part is really about the environment, their needs and what they have been using in OpenStack. But this part tells about their big data workload and environment. So this is based on the latest data. They probably have like a 30K jobs per day and they are using Hadoop 2.41. They are using SOFT interface and GlassDef as a storage backend. So on the workload pattern, they have a mixed CPU and our intensive workload. So the types of jobs are so different that they probably have like long time running jobs but also like a second level jobs. So we will see the details later. This is CPU intensive, our long compile and buildable mobile application, searching then compilation, big data analytics and some new generation. So they kind of like to have two groups and each group have like a 20 departments. Each department focus on different kind of stuff. So maybe these departments need to have another department need to stop. So the requirement to do there if our track team is pretty different. So that's the background of why they want to run big data in cloud. So we help them to kind of build to deploy Sahara to deploy big data in cloud. And we encountered many problem and issues. So in last part, I actually will share like seven problem and solutions in this custom study case. So the first thing is the, the first thing we heard from the customer is they complaining that sometimes it takes a long time to provision a cluster. So the reality is they have 30 key jobs per day but it includes many different types of workloads. So some jobs may be running in second and so it is not efficient if we run a second level job in a new cluster in minute. For example, if we divided the entire job left cycle like this, we will have the instance boot time, we will have the cluster configuration time and we will have the time for the big data workload to run. But considering for example, if we have a job that only sustained like three to five seconds, but we, it should typically will take like three to five minutes to create a cluster. So in that case, it's not, almost not acceptable for them to run the workload. So that's what we figure out is when you're trying to use in Sahara to create the cluster, you will have to sort about how can we satisfy the workload needs. For the small jobs, the better way is it looks like we can use a long run cluster for the small jobs but for the large jobs, I mean, we dedicated, we probably will have to use a customer size and optimized cluster to reduce the time to run the big data workload. For example, we can create a Spark cluster for some in-memory and analytical workload and we can also try to schedule that cluster to maybe a cluster with more powerful hardware. So this is the first problem. The second thing is they have a real complex big data processing. So a job usually runs multi-sub jobs and the customers are using multiple big data service to run a job. For example, Hadoop for the batch jobs, Spark for real-time and streaming jobs. So it turned out that different departments have different requirements. What can you do if they are asking for different types of big data services? So I think we only have two ways. The first thing is we run one single cluster that provided multiple big data services or we run multiple clusters for specific services or purposes. But if we combine this problem with the first one, so for example, if we have a small job cluster and then at that time, we will have to provide a long-run cluster. So in a long-run cluster, it's possible we cannot do the second approach. For example, you run multi-clusters for a long-run cluster. It will be kind of with your resources. So it's a trade-off. You have to think about, based on the first solution, I mean, to provide a long-run cluster or customized, optimized cluster and then think about whether we should run a same cluster with multiple data services or whether we should create multiple clusters each one running a different services. The third thing is the storage choices. Here, the problem that the currently most of the solutions is on storage part, we can run each DFSE inside of the virtual machine. But the problem is when the iteration down when the data will be lost. So you still have to position your data to external storage. The second approach is we leverage external HDFS but we will lose the benefits of locality and such kind of stuff. Nothing of the external HDFS solutions, people may have security concern because the security, in this case, we will need the host and the instance that both trust the HDFS, the storage part. So the last thing we want to highlight is the SWIFT. We know there is a SWIFT-FS project which can provide EMI with S3, each counterpart like data solution. So in that case, you do not use HDFS at all. So this is exactly what our customer is using. They are using SWIFT-FS then Glass-FS as a storage part. So basically that's why it's what Amazon is doing. So this is a storage consideration. So the solution is, I think, it should depend. If it's performance critical, possibly people will go to the internal HDFS but finally they still have to position the data to external storage. And if you want to provide services like what Amazon is doing, possibly they will do the SWIFT approach. The fourth issue is when you're running big data in cloud, the most common sense on the external storage part is you typically will have to store the HDFS data over cloud storage. So what the customer asks most is, hey, in my SQL cluster, I already have two replicas. So why have I to use three replicas in the cloud storage? For example, if we are running HDFS, let me see SAF or Glass-FS. So if we configured the replicas in HDFS with RA3, why should I use three in SAF? So our recommendation to them is to not use HDFS over cloud storage. Go into the SWIFT approach, for example. But if you must have to use that, then it goes to another trade-off. Let's take SAF, for example. If you use only one replica in HDFS, you probably will not have locality. And we can leverage the SAF to do the replication, the consistency work. But if we only use one replica in SAF, then we lose all the other things the cloud storage system provided like consistency, rebalance, and checksum, et cetera. So this is another thing we could consider it when we deploy big data in cloud, trying to figure out which type of storage approach we will adopt. The 5e theory, the cluster scaling. So it's pretty typical when you're running a cloud environment, you will change your cluster size. You will, for example, add more nodes, add more service in that cluster. But that kind of operation is almost not acceptable if you are running big data in cloud. So why? Because, for example, we have a provision in our cluster for the big data services, and we want to add more nodes. So the cloud system, they will do the data balancing rebalancing work. So that time probably will be longer than even you're provoking a new cluster. So when we're doing this, we probably have to be cautious about the cluster size. I mean, we will have to figure out what's the best cluster we will do probably using some, left some headroom or spaces. So the other thing is, particularly, I mean, potentially in future, we can use the container to reduce the overhead because we can use the container to reduce the instant boot time. So maybe provising a cluster is much faster and rebalancing the data is much faster. So the problem six is resource configuration and monitoring. OpenStack cannot monitor resource used from Hadoop off-back. So we know Sahara have some plan blueprint to enhance this experience to add the monitoring features, but it's not in there so far. So currently, basically speaking, there is no way you can monitor the resource and configuring the resource from the cloud when you're running big data in cloud. So the approach people are using today is still using the traditional big data workloads like big data application like YAH and Spark Web UI here to monitoring the resource for the big data workloads. The last one is talk about OpenStack of using control. So when you, it's a common scenario that people are, they're probably using different version of the big data application. For example, they may need different Spark cluster, different version of the software stack, but in OpenStack, the new version can only be supported in the latest releases. So it's hard to leverage the new features for the big data and the users. So the possible solution, we talked about there are two possible solutions as one is you put new features back to the end, which I think no people will do that soon, or possibly time consuming and very difficult to do that. The second is we can leverage OpenStack caller and to run OpenStack services using container technology. So that's exactly what we are doing now. So we'll see that that can help on the OpenStack version control path. So these are all the seven problems and the possible solution that we have encountered with this specific customer. And here comes the summary. So we shared the customer pinpoints for the Sahara deployment in their cloud environment and we highlighted kind of the possible solutions. So definitely we need a better big data support in the cloud. So the call to action part we address and upstream the solutions. So if we have any questions and suggestions, we can send an email to the guy here. So this is a legal disclaimer. So something in the slide may be changing in the future and I'm not the one to be blamed, you know. Okay, so here come the question path. Any question? So if I cannot answer the question, I will go back to this guy and make sure he can provide the answer. Thank you for the presentation. Okay. I have a question about the storage performance. You mentioned the internal HDFS configuration is more much provide that configuration provides a high performance compared to the sweep configuration. Do you have any specific performance number compared to the? Yes, we do have the numbers but it's in another slide I will forward to you later. So we have done some kind of a test with this three different configuration and plus the previous virtual machine and bare metal performance. We have another report to be shared. Okay. So in virtualized environment, what's your suggestions for the resource provision scheme more specifically? How many VM instances do you run in a single host? Good question but that I think it depends on your hardware. If you are using a typical for example to do socket processor which may have 40 calls or even more, what I observe with people who already do one to one lately do not do any over commitment. Oh, sorry, you have to go there. Oh, I forgot that. Can I hear you? Hi. For the network, for the storage layer, you said internal HDFS versus Swift. So what were you using for your client and where did you see the performance things on using the local storage versus moving on the probably object storage because that will have their own performance. You know, on the local storage, do you mean the external HDFS solution? Internal HDFS. So external HDFS have the highest, the better performance than Swift approach. That is basically because when you're running a bigger data workload, when they do the iteration, rename is a typical operation. But in Swift, in the object storage, for the rename operation, you have to get that object and change it and delete the previous object. So it's a different type of operation. But in previous HDFS, what do you need to do? Basically change the map data, update the data. So that part is why Swift performance is much lower than the external HDFS approach. I agree with that. What I'm saying is you suggested your client to use Swift, right? That's what I heard from you, that you suggested your client to use Swift than HDFS. You client using Swift? No, in your use case, what was the recommendation to use Swift or HDFS? Oh, sorry, I got that. So it's basically depending on whether you want to provide object storage or whether you have object storage. If you already have object storage and you have to load that storage into your big data analytical cluster, in that case, possibly you will go to the Swift approach. Because you already have object storage cluster running. And what about the replication factor? I mean, on the next one, issue number four, you were talking about replication. Yeah, there is no... So we can only provide several recommendations. But to be frankly, we don't have best practices so far. So it looks a lot for most of the customers, they still want to stick to their own replication in the HDFS layer. So it's kind of a trade-off. So we have to see how we can handle that part. Okay, thank you. Okay, if no more questions, thank you for joining.