 So my name is Oscar Martinez. I lead the Big Data Cloud and Advanced Analytics team in Clipix, a data consultancy firm specialized in the traditional BI. And for the last few years, also on Big Data, machine learning, AI, and also specialized a lot on cloud technologies. Today I'm here to talk about a story of a big blue elephant and a big green elephant. I don't know if you guys have ever seen two elephants mating, but it sure is an interesting event. And so this is the story of these two elephants that basically are now creating an orange elephant. So I divided the talk in these three areas. First, I will start with an introduction, what I call the beginning of the pacification for cloud data. And then I'll jump into some lessons learned of a bunch of projects that we've done with the old cloud data. I'm quoting for the ones that you cannot see because you're far. Then I'll talk about the future, even though technically speaking it's kind of present because they have already released the new platform, the CDP, the Cloud Data Data Platform. But it's really going to start hitting everyone half of next year in reality. And then some bullet points, the key takeaways that I want you guys to take when you leave this room. OK, let's start. So cloud data, this is a bit of a journey of cloud data, very high level. In 2004, for the big data geeks like me, you may know that Doug Cuting and Mike Caffarella, based on a couple of papers released in Google, they basically created Hadoop, which quickly became a quick success. And a lot of the web companies, Facebook, Yahoo, and so on, started to ride away using it. It was the whole open source thing. And in the end, the Hadoop, the base Hadoop, was a bunch of different services or software that you had to install in all the nodes of a cluster. And we are maybe talking about hundreds of thousands of nodes. It's not only one software, a bunch of them. So managing this whole thing was quite tedious. The early guys could do it because they basically created the product. For example, Cuting was basically for a few years on Yahoo. So he created it, so probably he was the best guy to make these deployments. But a few years later in 2008, some guys from Oracle, Yahoo, Facebook, they said, OK, what if we try to make this Hadoop thingy a more enterprise-ready product? And that was the beginning of Cloudera. First they released the Cloudera distribution, including Hadoop. That was the first name that later on was rebranded to the distribution of Hadoop. And that was 2008. Then I put some three-point or four-points actually there. That's when First Yarn, then Impala, Spark, and all these other projects appear. So I could focus in that area. But I prefer to focus in something that happened a bit later, around 2017. I think that's really the foundation of the three points that really now drive Cloudera from being initially basically a Hadoop big data company. Cloudera was a big data tech. And now they are changing their message and actually their whole strategy to become the enterprise data cloud, basically a platform for digital transformation. So they are opening the scope of what they attempt to do. In 2017, three things were incorporated under the Cloudera stack. The first one is the Cloudera data science workbench. Basically, until that moment, Cloudera or Hadoop were stuff that you would install on virtual machines or on premise. Data science workbench was the first Cloudera thing that was running on Kubernetes. It still runs on Kubernetes, which was a shift already from stuff running on virtual machines directly installed or stuff running on basically Kubernetes on containers. The second thing that I want to highlight is Cloudera Altus. Until that moment, Cloudera, still today, it's very, you basically put Cloudera on premise. It's thought for on premise. But at that point, Cloudera saw that they are called Cloudera, but they were not really on cloud. So they started to have their first cloud offering. This is the first Cloudera pass offering. So they started to have Spark as a service on cloud, Impala as a service on cloud. And together with that, it was also released the shared data experience, which is basically a common data context that basically allows you to separate compute and storage. So you have the storage. You create a metadata layer that manages all the data, governments, the security access. And then on top of it, you can create the different compute engines. That's a formula that is already being used by other companies. Some of them are in outside and very successfully. They are using it. We include because I used these technologies. OK, so those are the three points that for me really made the foundation of the direction of today's Cloudera. Also, of course, there is the mating thingy that I mentioned one year later, last year, the announcement of the merger with Hortonworks, which as I said, there were two basically competitors, rivals fighting, challenging each other constantly. And then they are becoming the same company. Hortonworks, in my opinion, was better in terms of security, the way they manage security. And they were also better in their streaming and real time services that they shipped. While Cloudera on the other side, they were better, in my opinion, again, on how you manage a cluster and also on the BI services. So by combining these two, you can choose the best of. That's actually what they did and have a very, very strong offering. So after the Hortonworks merger, the very first thing that Cloudera incorporated was the Hortonworks data flow, rebranded it into Cloudera data flow, which is basically the NIFI in a cluster integrated with the whole stack. And now, a few weeks ago, the official release of the Cloudera data platform, the new platform of Cloudera that I will cover later on, is an umbrella term to basically talk about a few other sub-products. But I'll talk about that later on. Cloudera is everywhere, and now even more with the Hortonworks merger. They are in the top companies on banking, and telcos, and governments, and automotive, pharma, technology. You name it. They are basically everywhere, but on the old traditional. And they want to jump to the new situation to keep the same for the new Cloudera platform. Why is Cloudera popular and used? I think the main reason is the multi-function. It's a single platform. Basically, inside the platform, there are multiple services, but you govern everything from a single point. That can act as multiple things. It can be a lake. It can be a data-engineering platform. It can host different types of data analytical stores. So traditional, well, actually not traditional, but data warehousing type of stores, search engine. You can host a search engine, basically solar. You can store real-time stores in Cloudera with either Kudu or HAs, depending on the use case. You can have operational databases, also with HBase. You can have data flow streaming applications and pipelines with Kafka, with NIFI, with Storm, Spark Streaming. And then with the whole data science workbench, you can have data science applications and also with Spark. So it basically covers, I think, all of it. The only thing I actually miss here would be the data visualization. And you may have heard that Cloudera, a month ago or two months ago, they acquired Arcadia data. So I think in a month from now, we will also see there the BI platform or the visualization services. Maybe it's the only thing I would miss now as to have a really complete and just acquire Arcadia data. So they're all hitting that way. Okay, as I said, Cloudera, the traditional Cloudera, was something that you would install on on-premise clusters, okay? And basically, I've been in that spot. Actually, in a talk that we will be having tomorrow related to a success story in Andorra government. Okay, and basically, you arrive there, they want a big data platform, and the choice was, if it's on-prem, I mean, then forget about AWS Azure Databricks, okay? Because that's on-prem, it was either Cloudera or Horton, you had to choose one or the other. Cloudera basically is a bunch, you probably have seen this blue graph, blue pick many time. It had all the services to fulfill these workloads, so a bunch of SQL services, streaming services, security, and so on, okay? And the same with Hortonworks. Hortonworks also has different services, some of them are common, some of them are slightly different versions, but pretty much took over most of the workload, okay? Later on, HDP didn't include the NIFI, so they released it as a separate product, which was the HDF, Horton of the flow. Basically, NIFI plus some things to is how we treat with NIFI and with Kafka, actually, another component that is included in the HDF. Okay, so as I said, and to wrap up this section, the Cloudera Data Science Workbench with the introduction of Kubernetes in the Hadoop world, Cloudera Altus, which was the first step of Cloudera into the past, into really going to a detachment of storage and compute into cloud, which was enabled by the SDX. Those three things, those three pillars are the ones that Cloudera has now built on top, the Cloudera data platform that I will basically review after this second section, where I'll basically be talking about some lessons learned when doing data analytics platforms with Cloudera. So in one way or another, you probably have seen different approaches to this graph. This basically indicates a whole end-to-end journey for a data analytics pipeline. From the source system, you ingest these sources, either with batch processing tools or with streaming processing tools into a raw data. That raw data can be a data lake. If you were using traditional BI, it was basically a staging area on a database, or you would also use a streaming platform like Kafka to ingest your events in real time and host these events. Then either you use data lake, staging area, or a streaming platform in real time or in batch, you will need a bunch of services that will transform this data into a format that you can actually, or a structure, or a store that you can actually use for analytic purposes. That be data warehouse, or a search engine, or a real time store for really quick analytics, or an operational store to store information that will be used for a web base or any operational activity. And then in the end, for the typical analytic operations, you will be doing some dashboarding, reporting, or data science projects, okay? And ideally, you would love to have a layer on top that governs and secures the whole thing with as many steps as possible. Okay, so just something that may look obvious, but in the traditional BI, in the traditional data warehouse, we had the staging area with the tables kind of raw, and then we had the final area, the SDE or the SDD, and you would have the final tables with the fact that they mentioned, okay? When you put this into the data warehouse version big data, you pretty much have the same. You have in the raw data, the data lake, you have data, all of it, data from source, data from structure, and structure, and semi-structure, and by this data, I've seen, we've been in projects that this aggregate. I have my data lake full of data. How do I use it? Because it's text files, unformatted. I mean, how the hell do I use my data lake? There are tools to use it on top, but in the end, in the end, you'd always need to curate that data, okay? Because the data in the data lake, the concept of the data lake is great, but to use it, you're most probably gonna end up creating curated data sets which probably will be stored in, maybe in the same data lake, but then you can't call that data warehouse if it's properly formatted, and structured, and partition, and so on, but you may also store it in a search engine, or a specialized data warehousing solution like Snowflake, okay? You can also store it in a search engine for, to enable Google-like queries in a real-time store, an operational store. And then, for specific functional areas, or for specific applications, maybe for a machine learning pipeline, you're gonna end up also creating enriched data sets, or joined data sets, aggregated data sets to basically feed this application, okay? So probably, I mean, these are those are the three layers. You have the raw data with the data lake, the curated data sets, and finally some enriched data sets. And normally there are these three logical layers that when I talk to customers about having this kind of data analytics platform with big data, it usually boils down to these three different layers, okay? So probably one more layer than we used to be. So I also wanted to spend a couple of minutes in lessons learning on sizing this multi-function or multi-workload cloud data clusters. Traditionally, when I would enter with a customer and I would say, okay, you know, you are in this situation, you need a cloud data cluster, and he would, okay, let's size it. You know, in the beginning, tell me the data you have, okay, one petabyte, okay? So basically to store it in Hadoop, you know, Hadoop replicates everything three times. So you basically multiply the amount of data by three and that's the size of the HDFS cluster. Okay, that was okay for a first iteration, but nowadays it's slightly more complicated because now HDFS fault tolerance is not only done with replication, it can also do ratio coding. There are not only HDFS, there are other types of stores like Kudu or Solar, and now it will start to be in these options also Ozone, Apache Ozone will enter preview in the next cloud data platform. So you will have three, four, five different type of stores to store our data, so it's not only HDFS anymore. Actually, I heard from cloud data technical guys that the future looks more for Ozone, that it looks more for HDFS, so it looks Ozone will be a thing in the future. Even if we were to be stuck with HDFS, it's not only CSV, you have also parquet or C sequence files and a lot of formats and depending on the use case, one feeds more the other. Okay, so it's not as simple as multiply by three. So basically we have a process that we have applied already a few times, for example, like we will say tomorrow, basically an Excel file in which we list the data sources and then for every data source, tell me how much data do you have right now, the historical data, how much data do you generate per day, per year? This data, what are the fields that you will most probably query? Usually time, client or so on. Okay, this data set, what is the retention period? You want to keep the data and have access data for 10 years, for five, okay? And then you remember what I mentioned before, you have these three logical layers, the raw, the curated and then rich. Which ones do you want to store? There are some clients that want to store the raw, all the raw, okay, that may be quite heavy. Most of the customers are happy with storing just the curated, but you can find both things. Then for every of these store data sets that they want to store, then you'll have to find which is the best store for it, which is the best format. If it's on the data warehousing side, how you partition and how you cluster the data, usually you will do that by the variables that you are querying. Or if it's HBase, which model you want to, how do you build a key in HBase? Or what are the solar indexes and next, the ozone, how you split the ozone and wherever store, how do you leverage? So basically we have this Excel, it may seem rudimentary, but it actually works. With this Excel, we fill it in, we did say the type of the source, the type of retention, all these things, and that basically gets that, that is the total size that you will need per sort and per store data set of that source. And then basically you define and note the specifications. Usually cloud data says there are three types of nodes, the IO intensive, the data intensive and the average ones, okay? And depending on the overall cluster usage, you're gonna most probably go for the average, but in some cases the data or the IO. Anyway, you'll say, okay, I have my node because I have disagreement with this hardware vendor. It is 80 terabyte disk per node, whatever, 256 gigs with a 16 or 64 cores, whatever. And then basically you made the division and okay, you'll need 15 nodes. That usually contempt most of the customers. There have been some customers that they also came to us with, that's okay, but I want a number of nodes that guarantee certain performance SLAs. So that query, I wanted to run it until 10 seconds. And that is really hard to estimate, okay? We did our best in this Excel and in this application to make some assumptions on the disk speed and on how the data is partitioned and clustered to be able to give estimations of the performance that you will get eventually in a cluster in the future. But that is a very hard thing to do. So, some lessons learned for different workloads. In data warehousing, usually you will store the query data sets in either parquet if you were in cloud data and you were using impala or you would be using ORC if you were in Hortonworks or you were using the Hive live long and process or low latency and I think, depending on the name, not the Hive LLAP. And as I mentioned before, you would partition by the most common query columns, usually by date, what I see most in practice. And the clustering, you would, for every, for example, parquet or C file, you would cluster our bucket depending on the application, they call it one way or the other. But basically it's within every parquet or C file, you block the data, maybe by customer, not the customer ID, but maybe the customer ID, but some other attribute that you may also filter by. Okay, in the search engine, something that we found out is that if you read the documentation, they say, no, no, you know, if you want to put data on solar, which is the solar engine that you have in cloud data and also in Hortonworks, you know, you need to do it with morph lines and a map reduced for batch and flume for real time. In practice, we've been doing it with NIFI. You will hear NIFI a lot, probably if you've been attending this kind of events, NIFI is a very hot thing and also to feed data into search engines. The cool thing of solar, is it better than elastic search? Depending on the workload, in most, many cases, maybe not, maybe yes, the cool thing of solar and having this inside cloud data is, of course, the integration with the rest of analytical stores. Solar actually stores data on HDFS. So you can imagine that that enables some integration better than having it in a separate store outside of the cluster. In real time, basically, you're gonna be using these four things, NIFI is the ingestion to bring data outside to inside the platform. Kafka is the big bucket in which you will store the events and then these events may be the same, NIFI will get them and store it in a Kudu or in an HBase, okay? Is this the only option that you have? No, but using this stack within the cluster will also, similar to what I said with solar, give you this better integration. Okay, now let's jump to the future of cloud data, the cloud data data platform. So as I mentioned, cloud data has started this trip from being a big data tech to being the enterprise data cloud, the digital transformation platform, okay? And those are the four pillars that they are building on top. It's an hybrid cloud, so that is something that you can basically only have with cloud data because you can have a cloud data platform that has some things on Azure, some things on Amazon, some things on Google Cloud and some things on prem and have a single pane to control all these different workloads. They also has replication managers or workload managers or it detects when workload X is running slow in that environment, you can replicate it to other environments to have, to solve a pain when you have a burst in usage or stuff like this. Multifunction that remains like it was but it only got better, the type of workloads you can tackle are larger and larger. Secure and govern, being in a single platform you can have a nice data context and everything govern which is also quite unique and a change for cloud data is that now it becomes completely open. So before cloud data 96 or 97% of cloud data was open source, there were 3% that they were not open, now everything will become open. Okay, this is the cloud data platform. I'm gonna start on the top. It is an hybrid and multi-cloud platform. It has services to run on edge, basically minify, okay? And it can run or you can deploy parts of the CDP in multiple clouds, basically Azure, Google and Amazon, right now only Amazon. You can have a private cloud, you can have it on prem, okay? So and you can actually combine all these in an hybrid cloud on prem, the platform. The second part, starting on the top, the cloud data SDX. This layer that you put in every environment to have a govern access and a govern data layer, okay? So you may, basically for every environment you will define the kind of a data lake where the data is and on top of the data lake there is a meta data management for the security and governance of this. And then you build on top everything, all the compute, basically. Then I'm gonna go to the bottom of it, which is the one view. There is a single pane of glass to access all the different environments that may comprise of a CDP platform later on. I'm gonna give an example, probably you'll understand that. In the end, what you have under the hood, it is cloud data runtime. That's the new name of the distribution and it's basically the merge of CDH, the cloud data software before and HDP, okay? And it takes the best of both walls, as we will see before. You can, inside CDP, create data hub clusters which are basically traditional clusters, more or less, that run on virtual machines or on prem, okay? And then the new thing here is this, what they call the analytical services that are services that run on top of Kubernetes, okay? And right now, I'm graying out the stuff that is not available, okay? Right now, you can create Kubernetes-like services for the tower housing and for machine learning. Okay, what is this cloud data runtime? The new open source Hadoop distribution that cloud data is releasing. It's basically adopting superior technologies. So you would have, before, in an HDP cluster, embody as the tool to basically manage all the installation of the services in the cluster and in Cloud Data Manager, it was the counterpart, okay? In my opinion, Cloud Data Manager was better. In Cloud Data's opinion, also, because now, basically, Cloud Data Manager will become the only management tool per data hub cluster, okay? Sentry and Ranger, as I said before, for me, Hortonworks did a better job by adopting Ranger. Also, Cloud Data thinks the same because Ranger was, it will become the default security for authorization and authentication, not basic authorization and in detriment of Sentry, which was the version, the tool that Cloud Data was using before. And so on with Cloud Data Director and Cloud Break and with Hyvon Spark and Hyvon Test. Hyvon Test will become the new Hyve flagship, ETL version of Hyve. Then some technologies that were overlapping, they basically merged, like Hue and Dash Lite, or the BDR, the replication that was in Cloud Data with the DLM, so they are merging all these things. Navigator is combining into Atlas. It will be called Atlas, so it basically will be the Atlas 2.0 that will absorb Navigator. And then some technologies, you actually keep them. I was curious to know what would happen with Impala and Hyve LLAP, and they are actually keeping both. And there are, we'll need to see which one you would use in which situation. Hyve being more suitable for traditional BI queries and Hyve LAP, it's very good for caching queries, okay? It's difficult to find which is which. Same with Parkeano Receive, actually they're keeping both. Kudu will be there. Also, Hyve 3.0, which brings Acid, which is a long-waited thing for the community having Acid on top of Hyve. Druid will also be there, okay? The Cloud Data Science Workbench there with Zeppelin, Nephi of course there, Phoenix is there, Nox, Libby, all these things are there. The new thing is the Virtual Private Clusters, which were added in Cloudera 6.2, and that's basically what allows even on-prem to separate storage of a compute, okay? You will have private cluster. The very basic cluster you need is the storage cluster that is basically in the cloud, the so-called data lake cluster, and then you will create the compute engines. And entering Preview in the first release of CDP will be your zone, an object store S3 like in Cloudera, even on-prem, okay? Which is quite a breakthrough. Okay, and Atlas I already mentioned. Those are the places where you will be able to run Cloudera. The form factors, the actual products that you can install, there are three. And actually you can combine them, okay? The first thing that is going to be released, actually already released, but to access you need to talk to Cloudera, is the CDP public cloud, as of now in Amazon, okay? That is basically the form factor for the cloud. You also have CDP data center, released in principle 15 of November, but also to get access you need to talk to Cloudera, which basically the version of it to run on bare metal. And an evolution of the data center will be the private cloud that will be released half of next year, okay? So I'm gonna more or less explain each of them to give you an idea of what they do. This is the CDP version for public cloud. Basically you have a management console to govern everything. You will rely on S3 or the Azure.Lake, okay? Or the Google equivalent, to basically host your HDFS, the data, okay? Then to, in the moment you create this environment, you need to define where the data is stored. So S3, Amazon, in data lake on Azure or the Google cloud. And then automatically a small cluster is created with at least two nodes that basically manages this storage and has all the metadata stuff for the security. So it has Atlas, it has Ranger, all these things will be in this small management cluster, which is the data lake cluster. Once you have this base cluster, you basically are able to create data hub clusters, which there are some templates. So there is a data hub cluster for data engineering with Spark, you know, that you just click, click, click and it creates it. But you can also build a traditional that, you know, clustered where you can choose which and which not and how you distribute everything. So you can, there will be templates or pre-built clusters, but you can do your own. And those are the traditional clusters that would run on virtual machines, you know, as we know them today. The new things are basically this analytical services that are meant to be for ephemeral workloads. So when you need your HR department to analyze data only on the end of the month or the sales department, then you can create a Kubernetes base data warehouse basically in Palocaster that will read the data, okay, and only for one day and then turn it off. And the same for machine learning. And later on the cloud data will actually release the rest of the data flow, the data engineering, the rest of workloads as analytic services. On the data center version, basically, we don't have these Kubernetes-like services, but the other one is pretty much the same. Instead of using the S3 or EC2 instances, you basically will use the physical or virtual hardware on your on-prem deployment. On top of this hardware, you will create a base cluster, the equivalent to the Data Lake, which in the on-prem they call base cluster, and this provides the data context, basically the HDFS, the Hive Metastore, Atlas and Ranger, the basic. And then on top of it, as in the cloud, you can create the different data hub clusters for the different workloads or have just one with everything and so on. The private cloud is the evolution of the data center to mimic what you can do on the cloud. Basically, you will require your traditional on-prem hardware, so your big data servers, and you will require a Kubernetes deployment. And CloudData will actually release their own Kubernetes implementation, the CloudData Kubernetes, still to be released. And you will add that to be able to create these ephemeral workloads. The very cool thing of the CDP is that you can actually create multiple data centers, multiple private clouds, multiple public clouds, and have everything govern under the same umbrella CDP. And then CloudData offers as a SAS, this management console, where you can control the different environments. And as part of this SAS service, there are these three services, the workload manager, the replication manager, and the data catalog that will monitor how these guys are doing. And when it detects that sort of workload with the workload manager is struggling in that environment, you can use the replication manager to move data and to move workloads, pseudo-automatically to a different environment that you just created, and that will maybe mean a data hop or a female cluster, okay? And of course, the data catalog on top of everything helps you see all the different SBX, all different data context in the different environments from a single view, okay? And this is pretty much the unique thing of CloudData. Okay, so to wrap up the talk, there are a few bullet points that I want to erase. Multi-workload or multi-function, big data cloud classification is now a thing, okay? Storage and computing detachment is the new black, at least for CloudData, it was already for a few others and they are really building on that and while that is the new black, the new white is Kubernetes and be able to leverage containerization even for traditional big data tech like CloudData. And on top of that, minimize, centralize, simplify all the admin, all the security tasks with this single view with the SDX and so on, okay? It's able to run a multi-cloud, AWS, Azure, GCP, or even private cloud, so you are no longer attached to one cloud and do the whole deployment on Amazon and Azure and then a company decides to go to another provider, then you need to reshape and re-engineer everything. Actually, in my team, I have guys that have been one year in a project because of these type of decisions. Now we were in Azure, now we go to Amazon, so now let's change a pure Amazon big data thing to Azure. Okay, so that things actually happen. You can combine all of them actually together, so you can have something in one cloud or something in another cloud, something in the private or on-prem and combine everything on this hybrid environment. There is a new Hadoop distribution that is the evolution and the improvement of the two basically top distributions that we had out there. You can combine data lakes, base clusters, which are basically the equivalent on-prem with compute permanent-like clusters with ephemeral clusters that will be powered by Kubernetes and all with fully open source software. Thank you a lot for your attention. Now we'll have five minutes left for questions. As I'm seeing not many questions in the previous talks. If you want to ask them, welcome. If not, just I'll be around here or outside if you want to change me for a while. Thanks all.