 Guten Morgen. Wie geht es Ihnen? Ich nächt Daniel Gielfix. Er heißt Karan Singh. Wir freuen uns sehr, heute hier zu sein. I don't understand any one of this. I'll get a go at it. Now, out of courtesy of those who can't speak German like ourselves, we will switch to English. My name is Daniel Gielfix, and I'm joined today by Karan Singh. We're both from Red Hat Storage. I happen to be working in the business unit focused on storage and hyperconverge infrastructure. And Karan is a solutions architect, part of our solutions engineering team. And we're here to talk to you today about managing data analytics in a hybrid cloud. So here is the agenda that we'd like to propose to you in this short discussion. We would like to review some of the customer challenges that we have seen in the company working with customers hands-on who are trying to deal with their issues related to data lakes and managing data. And some of the common approaches that these customers have used, perhaps you as well, in dealing with inaccessibility to data in a timely manner. And then to talk about the concept of a shared data lakes approach with a common data set. And then more detail on how it works and in what use case. And then perhaps some summary comments on how you might get started. So beginning our challenges can be summed up in the fact that what we've seen is that an increasing number of data analytics teams are needing greater agility to spin up their own clusters with a common private cloud infrastructure without the cost of duplicating data sets in non-shared HDFS silos. Why is this happening? It's because more and more customers have an increased amount of data, big data, whether it's structured or unstructured. And this data is accumulating as everybody knows. But the challenge is to be able to reap timely insights from this data in their existing data infrastructure. And traditionally this has been done in monolithic Hadoop clusters. And many people have been challenged by this effort given the amount of data and the amount of people who are trying to access this data. And this has led to a number of alternative approaches to solving their SLO problems. So in general there's been three main approaches to dealing with the issue of too much data, not and too many people trying to access this. First given the fact that Hadoop was built to scale, the most intuitive things that customers would do that we've dealt with is to simply increase the size of the cluster. That's option number one. And that involves growing the compute and storage in tandem, just making it bigger and bigger. And unfortunately this has frustrated folks because it's difficult to maintain a single environment. And it's difficult to manage conflicting teams who are trying to access this single amount of data. And at the same time it's hard to grow and to scale the compute and storage independently. And realizing these sort of limitations, some of the customers, many, seem to think, well, you know, if we can't grow up, if we can't scale up, let's scale out. And this involves duplicating the data sets. And while this might address some of the isolation challenges that you have by being able to give certain people certain amounts of data and tools, it's also mighty expensive, mighty expensive to acquire, mighty expensive to maintain. And plus within each set of data you're still growing the storage and the compute in tandem. Now some customers have just decided, all right, you know, this is too cumbersome. I might as well just upload everything to the public cloud where, you know, the storage and analytics can all be managed externally so they can store them in Amazon S3 and provision the access to the tools through EC2. And this happens to be very popular in certain environments where the data originates from the public cloud. And that could be in the government or public sector or healthcare or wherever. But we found that if the data originates on premises, it's much more efficient to be managing the data and the data access on premises. So, you know, that would lead customers to think of a third way, which is here, which is actually making storage a common infrastructure service instead of something that needs to be managed by data platform teams. And it means that teams themselves have the power to spin up the clusters as they need and not have to duplicate the data because they're using a shared common data set underneath. And ultimately the goal in all of this is to make sure that the right people have timely access to the data at the right time and to do so without spending a fortune. So what we're talking about here is disaggregation of storage and compute for analytic workloads. And by having a common data set with a common object store that can scale massively, you're not having to purchase and maintain duplicate data sets for every single group. In fact, you could be buying tens of petabytes of data instead of hundreds of petabytes of data. And by giving folks the ability to spin up and spin down the clusters as needed using OpenStack, you can be provisioning and increasing a customer's agility to do so. Ultimately, we're maintaining a more common level state of resources where you're not having excess and you're not having too few. You're being able to right size the amount of data and right size the amount of compute. And by empowering the service teams, the service level teams, you are in fact giving access for the right amount of data, the right tools at the right time. So Red Hat has been working with customers for upwards of 18 months. Hands-on, many who have been challenging this, many who have given us ideas and we've taken these ideas and incorporated fixes into the product, which underlies this whole infrastructure solution. We call it the Red Hat Data Analytics Infrastructure Solution. It happens to be a combination of Red Hat's Chef Storage, which is the underlying core, along with OpenStack, Red Hat OpenStack platform provisioning and Red Hat OpenShift platform for those who are containerizing the data with a collection of services as well. It is offering our customers tremendous amounts of improvements on TCO and analytic performance and they're having success. And many of these customers have contacted us in an unsolicited way just to say how pleased they are with this type of performance. And some of the benefits that they are realizing by having this type of solution that combines the power of an object store with the power of OpenStack is just reliable performance across the board. Being able to have the right amount of data accessible at the right time for whichever data scientist needs to use it. The open provisioning of compute analytics via OpenStack is timely. It's very beneficial to everybody. Moreover, by virtue of our S3 API, you can have the same level of experience running in a public and private cloud environment using the S3. So it does make for a hybrid cloud experience using this on-premises. And ultimately you are, and they are seeing reductions in costs both from the KPEX side and the OPEX side, most principally by not having to duplicate data sets, by not having to constantly attach your storage and compute and being able to flexibly grow and scale out your compute and storage as needed. So now I'd like to invite my colleague, Karan, who will go into more detail on how some of this works. So thanks, Daniel, for setting the stage. So if we think about it from the architectural point of view, how does the different generation of data analytics looks like? So we categorize it into two different generations. So the generation one includes comprises of a monolithic Hadoop stack where the analytics vendor, they provide two things. They provide the analytics software itself, and they also provide the single purpose infrastructure for the analytics softwares. And the analytics software itself, it runs on their own single purpose infrastructure. And one of the problems with this kind of approach is that there is no efficient way to share the data set across multiple clusters of Hadoop. And if you have to do a lot of computing, if you have to do a lot of analytics jobs, then probably you will end up having multiple different silos of data clusters, and you need to think about a different and efficient way of not copying the data multiple times across different stacks. Another key point about this approach is that if you need to add more storage capacity in your existing cluster, you need to bear the cost of going through and purchasing the compute, which could be lying idle, and you would be simply using the storage part of it. And the other way around to is that if you need to add more computing power, you need to purchase storage which is not being used. So basically you are simply paying for the resources which are not in use. So this kind of a rigid combination of analytics software with the infrastructure that tied in a very tightly coupled way, and we need to find a different solution to it. Now, the way we categorize the second generation of this analytics architecture is that this is an emerging trend right now. So we have been talking to 20, 30 art customers, big companies, and asking about their pain points in the data analytics world. And somewhere around 75%, 75% of these companies, they agrees that yes, we are getting locked down because of this monolithic design here, and there has to be an efficient way of doing analytics in like now, or maybe five years from now. With a lot of new tools, so if you think about this, if you go back 10 years from now, eight years from now, they were like a handful of tools to do the stuff. So Hadoop MapReduce and Yarn and Hype, they were the tools. Now, with the explosion of all these data analytics tools like Spark, Presto, Kafka, and lots of other tools which I have not mentioned here, we need to have a different way to design the architecture. So some of these customers which we have been working with, as Daniel mentioned, that we learned from our customers. They were the early adopters in moving on to this kind of architecture approach, and the way they have improved their own large-scale data analytics platform and data analytics infrastructure is that they have desegregated compute with these storage. And now with this kind of approach, the analytics, so we are not, I mean, the analytics vendor, they will still focus on their own part, which is their analytics software, and what we are proposing is that they could use on-demand provisioning either from OpenStack or maybe from Kubernetes or OpenShift in future to provision the compute resources for their analytics environment. And all of this happens through a shared object storage underneath, which is based out of Cep, and that makes you, that gives you the flexibility you need to launch multiple different analytics clusters without having them share, without having them copy the data set across. All right, so this is how it looks like. You will have an OpenStack platform which would be powering the compute, a general-purpose OpenStack platform powering the compute, and you will then have a shared object storage using Cep, and everybody knows Cep is pretty popular in the OpenStack world since the last five years. It's one of the most used storage back-end for OpenStack, so Cep could be reused or you could grow your Cep cluster other than providing sender volumes or glance images to your OpenStack environment. You could repurpose your Cep cluster and create an object storage pool and use it for big data lake storage. And of course, you get the flexibility to provision multiple analytics clusters, as I mentioned before, and you don't need to waste capacity and money in just copying the same data set across multiple times. So if you think about, like, if you have, like, a few terabytes of data or, yeah, a few terabytes of data, a few hundreds of terabytes of data, that should be okay. You can copy 500 terabytes of data across five different, five times more. So it's not very expensive, but if you think about, if it's in the order of tens of petabytes, tens of hundreds of petabytes, if you copy the same data set with three different teams or three different clusters within the organization, it's going to be really expensive. And it is expensive not only for the companies, but even for the larger companies who are, you know, they have big budgets, but they are still expensive for them. With this architecture, you can give agility to your data science team. At a given point, you will have a data science team. They need some kind of compute and analytics platform by which they can do their analytics jobs. Other team, let's suppose the other team wants to run on the latest version of Spark libraries, but they can't do it. In case of a monolithic architecture, they can't do it because it's locked down. You have to have used the version that it's built into the monolithic stack. But with this kind of approach, you could on demand launch multiple different clusters of different versions with different tools, all syncing and sharing the data across a single object storage. So as I mentioned, you could use it for... It could be a multi-purpose analytics platform. You could use things like Kafka for data ingestion in the same cluster backed by OpenStack compute, and then you could do ETL jobs using Hive on MapReduce or Spark on... Spark SQL on Spark kind of things. You could have another cluster dedicated for interactive queries for faster query responses using Presto and Impala. And bad jobs could also be done using MapReduce, either MapReduce like long-running reporting jobs, bad jobs could go through MapReduce and Spark. So in a nutshell, you have a common layer providing the compute to different kinds of analytics environment, and your rest assured that you don't need to copy the dataset. It's all coming from a single big pool of shared data storage. Another interesting aspect of this kind of design is that you're getting... You can consolidate your infrastructure. So if you think about you have a siloed environment of compute and storage just for analytics workloads, then you have your general-purpose compute for your web application, for your databases, or for other public cloud or private cloud that you have built in-house. You can combine these two different silos. You can move in the analytics part into your OpenStack-based private cloud in-house. Definitely you could get some advantages based on the OpEx. Especially to have a dedicated person managing your data infrastructure. They could be a part of your OpenStack-based private cloud. So as we've discussed, on the left-hand side of this, you have monolithic design where you're getting... It's inflexible for you to scale the cluster and scale the storage. Basically, you have to copy the dataset and on the right-hand side, things will look more nicer. You have a shared object storage. Compute and storage are decoupled from each other. So you could scale each of these layer at your own pace as you need them. And solving a lot of problems of multiple data copies. And your data scientist could enjoy the shared data experience of the job which is completed on cluster one and the data which is generated by cluster one could be re-consumed by the other cluster instead of copying it across. And thanks to AWS S3 and Hadoop communities for working on this part. So Amazon is doing this architecture in their public clouds. It's been more than three to five years. They've been doing this. Both these communities work together and they have evolved the version of S3, S3A which is our file system adapter which is the glue which is making it happen. So this is the third generation of S3A adapter. The first one was S3 and the second was, if I remember correctly, it was S3N. It was the second time they tried it. And then they agreed upon S3A. So S3A is now natively supported by the standard out-of-the-box Hadoop libraries. And any tool which is being based on Hadoop jars, Hadoop libraries, they could simply take out-of-the-box, they could talk S3A. So all you need to provide is, you need to provide an endpoint, an S3 endpoint. In Amazon's case, it's Amazon S3. In a private cloud environment, it would be a Ceph Redout Gateway endpoint. Then you will configure it using the access keys into your Hadoop core site.xml files, things like that. And the tools then, tools like Spark and Kafka and MapReduce, they could, these guys can simply go and talk to your Ceph cluster like they're doing it in public clouds. So as I mentioned, as I talked about public cloud offering, so Amazon and even Google Cloud Platform and Azure, they're doing this kind of architecture in the public clouds. So if we talk about Netflix or Airbnb, these guys, they have been using Amazon Cloud for their big data workloads. They are using EC2 provisioning layer for the analytics, which is disaggregated from storage. And then they are using persistent storage from Amazon S3 to do the analytics on a shared dataset. So this is an architecture which is pretty popular and Amazon EMR is doing it. So we want to just move the same experience on premises using OpenStack and Ceph. So OpenStack could be your provisioning layer, which will give you the compute power needed for these clusters. And Ceph will provide a single shared experience for your data. You could also use, I mean, there are multiple ways you could provision cluster. There's no need for me to explain how you can provision clusters in OpenStack. But NOVA APIs are a great example. You could simply use your favorite tools to launch a cluster or using NOVA APIs or maybe Sahara OpenStack platform. It could also provision clusters for you big data clusters. So we are extending the same modalities into the private cloud from public cloud. So we have already covered most of these in this slide. So what you get from this, what are the benefits? You will have multiple analytics clusters, which will enable your multiple teams to access the same dataset without copying it. You will definitely get low price because you're not copying the dataset and faster provisioning of analytics cluster or I would say on-demand provisioning of analytics cluster as you need them. So if you need, you could also make it more flexible like, okay, you could make multiple OpenStack flavors. Okay, this is my flavor for high RAM or high memory kind of workloads. So you will spin up that particular flavor. And then this is the flavor for production, which will include my Spark 2.0 release and things like that. So you have the flexibility to choose what you need. Now we'll talk a little bit about how does it look like and currently how does it look like with different kind of data analytics pipeline, which is out there. So the modern pipeline looks something like this. So you will begin with a data generation stage where the data is coming in from multiple sources. It could come from a stream of events and then you could also ingest the data into a platform using bulk ingestion tools. I'll talk about the tools in the next slide. And once the data is persisted into the storage layer, you could use tools like ETL and MapReduce and Spark to do the transformation and joining the multiple data sets together and creating a data repository for your usage. You could also then convert and or transform these data set across different versions, probably like different columnar database version like ORC or parquet formats. You could convert your d serializer data and use it. You could join the data set, of course. You could do that as well. And for querying the data set, you could use tools like Kafka and your Clixim data set. Wait a minute. Yeah, this one. Then once the data is transformed, you could run data analytics queries and jobs for business reporting trying to make sense of what's the latest trend in your company and your product, how it's behaving. And of course, you can, the same data set and the same cleans data could be used by the data scientists or data engineers to do some kind of data exploration work and trying to figure out what's next in your, for the product that they should focus on. And finally, the growing term, the machine learning. You could, once you have your clean data, you could train the models based on the data that you have and you can do prediction things like TensorFlow. And the idea is that everything could be done using a shared object storage underneath and which is what this will enable you in the long run. So mapping some of the tools, some of the different tools for this kind of job. So data generation, I talked about it, and Kafka could be used for streaming. Even Apache streaming could be used here. And for ingestion, you could take a look on Apache Nephi and Kafka. And the long running ETL jobs could be taken care by MapReduce or even Spark because Spark is in memory processing. It's pretty faster. And once you have the data transformed, you could use, again, you could use Spark and MapReduce to do the reporting and doing the bad jobs. And for interactive query, which is pretty useful for the data science, they need to have responses pretty fast. You could use tools like Presto, Impala, and Spark SQL to achieve that. And TensorFlow for machine learning or even other tools which have not written here. So how does it look like with SAP? How many things we have tested with SAP? So this is a very key slide here. So if not all, we have tried to test a few of them together with a partner, with our partners, and together with our customers. So I'll talk about how we engage with customers in the next slide. Yeah, we have done a lot of testing in the last year and a half or two years. So for data generation, we chose a tool called STPCDS, which is an industry standard benchmarking tool for Hadoop stacks. So most popular data analytics people, they use these kind of tools to generate a data set. There is a possibility to generate data set as you need. So we have tested with 1 terabyte and 10 terabyte. And in some cases, also with 100 terabytes of data set generation. And once we have data set generated in the test environment, we then do all sorts of work on the data. Another interesting tool which one of our partner helped us with is LogSynth, where we are joining the structured data with a clickstream data and try to have a data set which is a combination of structured and non-structured elements. Once the data is ingested and transformed, and tools like we have used Spark SQL to run queries which are designed, which are defined in TPCDS benchmarking toolkit. So there are somewhere around 100 plus expensive queries with respect to compute and as well as for data. So we have chosen somewhere around 50 life queries which are specifically storage intensive. Because here we don't want to make sure the storage is doing good. So we chose some 50 plus queries which are IO intensive queries. And we ran through Spark SQL and Hive on MapReduce. We also ran the same queries in Presto and Impala just to see how does it look like with multiple different... How the performance increases or even decreases if we change the analytics engine on top of it. So Hive and MapReduce and Spark SQL and Spark and then Presto and Impala that we have tried with. And in the prototyping with machine learning, it was just a prototype. It was not a full-fledged benchmarking test with machine learning. But we have tried training a model with the dataset in coming from the Chef object storage. And once the training is completed, the output goes back to Chef and the model gets stored. And then you can use Chef to distribute the model to the application. So that works out of the box. But you still need some more architectural-level benchmarking on this kind of machine learning tool other than TensorFlow or things. So one of the offerings that we are providing is a close collaboration in the form of POCs and pilots. So what we do is first we qualify your case. And just before we begin, we try to say, okay, does it make sense for you? Should we go forward with this kind of architectural change or architecture shift in your current data tool? So what we do is we sit with your business stakeholders, your data engineers, your infrastructure engineers in a closed room for two days or one day, whatever, doing some architectural workshop and involving people from Red Hat, people from my team, in fact, and together with Red Hat Consulting, we sit in a room, we do brainstorming, we try to ask, okay, what are the key questions that you would like to answer with your, what are the key problems that you're facing currently in your infrastructure with respect to data analytics? So we jot down these kind of questions and try to formulate a test plan which we then convert all those exercises into a work which is something that we could reproduce using your dataset or even we could reproduce in TPCDS which is a synthetic data generation tool. And once things are in place, we have identified the business case, we try to move on to this new architecture of decoupled computer and storage, and if all goes well, we compare the results with your previous testing if you have done with your monolithic stacks like Hadoop and HDFS. So yeah, once everything is okay, we help you with phase rule out of this kind of design into your production infrastructure. And yeah, of course, Red Hat Consulting is there for all the support that you need. So this kind of an offering that we are providing to our customers like a paid POC and a pilot project. So before going into full-fledged into this kind of new architecture, you know, test it before putting all your stuff in this new architecture. All right, so Daniel, could you please give a quick summary? Okay, so thank you, Karan. So quick summary and next steps. First, I've already told you what I was going to talk to you about. Karan went into more detail, so let me summarize on what we just told you that the problems that maybe you but some of our customers have been facing or summarized here, that they are just missing their SLA's. They're not being able to get the data and the value from the data. And this is probably something that's happening to folks in this room. There's too many people trying to access the data, not enough tools, not enough ways to do it without duplicating all the data and spending a lot of money to do so. In lieu of spending an ordinal amount of money and excess in OPEX and COPEX, there's an alternative approach and it has worked with our customers and it might be right for you. The questions on the right seem to be the first way of weeding out customers, whether it's an opportunity for you, whether you have on-premises analytic activity, whether you are running with multiple petabytes. This is certainly not something for a little bit of data. This is a lot of data. Do you have multiple clusters, whether it's Spark and Hadoop? And do they need to have a shared data set? And do you also have non-Spark and Hadoop tools that need to access the same data sets? If you feel that most of these answers are positive, then it's likely that this type of environment might benefit you. So here is one unsolicited account from a government customer and because it's a government foreign customer, it's completely anonymous. But what this account talked to us about was really unlocking value. This environment, this architecture, has helped them unlock value by releasing the lock on data because before it was limited to certain people. Now it is open access and it's opening the process to more and more people and more and more types of analytic tools. Prior to using this type of environment, there was a lock on compute, meaning that they had a certain number of tools and there was compute and storage locked together and now they're able to spin up and decommission their compute according to the customer needs at the right time. And finally, innovation. This is allowing people to innovate more than they were before. As he says, it allows anyone to try to build up something new without the fear of messing things up. It can tolerate mistakes at all levels and by doing so our developers can be much more daring. It gives more power to the data scientists to be able to access this data. A lot of this has been documented by the teams inside Red Hat and including in Karan's team. There's a set of reference architectures and also some blogs that have been posted off the Red Hat storage blog page. The ones in blue are the ones that are current. The ones in black are planned. So there's a lot of investment here that's trying to document these best practices and put them in paper. And here are some pointers for how you can get more information on social media from us. The blog page that I'm referencing is right here. We're also very active on Facebook, Twitter and the web. So I would like to thank all of you and by all means if you have any questions please feel free to ask right now or catch up with us afterwards. Any questions? Hi, thank you for the presentation. In our teams the people actually like the local storage and the local computer in particular they keep them in lock step for performance reasons. Your architecture skills are very nice but how do you deal with questions of performance? So I mean I'm very hard to hear this one but is it the question around performance of... Yes, people like to keep their computer and their storage on the same boxes for performance reasons. Yeah, I mean that's true but if you think about when originally Google published the MapReduce paper like ten years back it was that you need to have computer storage because of the network bandwidth and network... The network was not big enough to push the data through but things have been changed quite a lot these days. Our customers... We are deploying Ceph clusters these days on a 25 gig or 50 gig or even 100 gigs of network bandwidth. It also... I mean it also depends on the use case that you have if you're having a high throughput kind of requirement then of course it could do it. We have seen the testing but something like if you have a real active or querying job, active querying jobs then probably you need to think about it double twice before moving into this but there are... Spark produces, helps you with in-memory things. There are a lot of things that you could tune in this architecture and one other thing that we are doing and we have been testing around in our team is that we have seen really great results with this. It's not like you will get 200 or 300 percent performance improvement compared to HDFS but we are actually in parity with HDFS. So the customers... The customer that we are working with, they said okay, if we can... If I need to... If I'm getting a parity on the performance plus I'm getting the agility of launching on-demand compute and analytics cluster I'm fine with this. I'm good with even with... I'm having 10 percent of performance loss. I could go with that. And one other thing that we are going to work... I mean the things are in the draft right now is the paper that we are going to publish after I guess around Christmas time or maybe early next year. So that should talk about in detail about the testing methodology we have used and how the performance compared with HDFS and what it is good for and what it is bad for. Okay. Can we have the question? Have you also been looking at the hyperconverged setup around your storage and compute on the same physical machines? Not specific... Not for this kind of workload. So if you have a data analytics workload probably we will not do a hyperconverged because of a lot of reasons with respect to performance. So the idea is that if you want to make a data lake kind of environment it's the best way to keep your computer and storage separate so that you could even scale it without paying extra even for computer storage. Otherwise it would be the same as as a monolithic design you will have computer and storage on the same box. Okay. Thank you. Past the closing time. We're past the closing time. So thank you very much for coming.