 It's so nice to see everyone here at Chicago. I'm Sunil Govindan from Cloudera. I lead the compute platform team and the Unicorn scheduling team at Cloudera. I'm in this big data space for over a decade, and I contribute majorly to the Unicorn scheduler, which is a native batch scheduler for Kubernetes in various roles. So in today's session, I'll be covering the big data migration journey to Kubernetes from Cloudera. So let us take an example of an enterprise company focusing on adopting AI. So they'll be looking at various categories and trying to assess the impact of the AI adoption. And as you can see, adopting AI could increase the productivity of R&D or generating faster content for the marketing team or even saving calls for their HR team as well. The common underlying factor to derive such a big improvement is nothing but data. Yes, the data itself is the differentiating factor here, and that will be helping to power the AI adoption. So I want to take a peak look at the data itself and its relevance. The first one is data explosion. Around 75% of the data is not structured. And Cloudera is managing 25 exabytes of data. That is roughly around 20% of the overall data. And 200 trillion objects are there in S3 as of 2023. Yes, everyone needs access to this data as well. And all the type of applications, be it AI or non-AI, the access is expected. And we are seeing around 50x increase in the consumer side of access. However, the question is, is storing and managing the data is it enough? I'll say no, because you need to get the best value proposition out of the data. Otherwise, it doesn't matter. So in order to do that, you need some powerful engines that can help to get that insights. So here is a quick overview about the data lifecycle and the use cases that Cloudera usually solves using the big data engines. So the first use case, as you see, it's a collect that nothing but the data ingestion. So this enables the customers to stream the data into their data product by providing capabilities such as analyzing the streaming data with complex patterns or similar actions and gain some intel out of that. And one of the examples could be risk analysis or flow detection. So in the data engineering, that is the second use case, it provides a tool set for ETL processes and it helps to cover a large set of users. The third one is warehouse. It enables actually business intelligence use cases and it helps to make sure that you get to do the right reporting and analytics on the data. The third one is operational database. It delivers a scalable real time and makes sure that your structure data is collocated with the unstructured data and still makes some analysis on top of that. And finally, the machine learning itself. So it basically empowers the organization to build and deploy your machine learning models and add the capabilities for the businesses that they are looking at. So let us look at these engines that I just mentioned here, right? But before that, I want to cover one more extra use case. So from the data engineer or data scientist perspective, they'll be trying to build some pipelines usually and take the data from the source and make sure that do some analysis and then put it to the final store. And one more challenge will be that whatever changes that they make at the source, it has to reflect in the target systems as well. So it should be real time. So I'm just taking a small example from one of our customers. Basically, take the data from multiple sources. Of course, the scale will be huge here and then apply the updates or apply the delta and then apply the final updates to the data processing. And usually, you'll be doing with asset transactions or multiple operations on top of that. And then you need to take periodic backups for the auditing and compliance purposes and then run some historic queries. Time travel will be very helpful here. And finally, the predatory services itself. And with that, we'll be able to make some good ML model out of it and take some data insights. So if this is the use case that we see usually, let's see what all engines are helping to achieve this. So we have Iceberg, Apache Iceberg as the top layer and then we have engines like NiFi, Spark, Fling, Impala, all of them are helping to make sure that the use cases that I mentioned here is can be scaled and you can get what you really want looking for. So I think I mentioned about few of those engines. So in our data platform, we have more than 30 plus open source projects and many of them are core data engines. And because of that itself, the compute itself is going to be very complex. And if you look at through the lens of a deployment architecture, some of these deployments could be long running, stateful services and some of them could be batch workloads, something which can be bustable, like go up to 20,000, 30,000 containers and shrink back. So, and one of the classical example is Apache Spark itself, right? And these engines with the use cases that I mentioned is pivotal for the generic use case as well. So, and these were the, these were, this is how our platform was designed with all these engines and the use cases that I just mentioned, it just makes sure that I think in the new era also, we are able to achieve the same thing by addressing some of the gaps. So I want to take one more linear look here, like the data processing methods and changes it evolve, right? And look at the last decade itself, the number of engines that came is too many, like recently the NLM is a new addition here. And the pace of this change is also increasing day by day. And thanks to the open source innovation, it's just, it'll continue to grow the way it is. And the change is also applicable for the platform as well. We had something called Yarn in the past, Kubernetes actually there, it helps us to scale even much better in the cloud native way. And if you look at the key takeaways here, data is a constant part. So engines change, the platform change, but data is at the end of the day saying. And the new engines platform need to be adaptable, it need to be flexible to ensure that they will be able to process the same thing. And the actions that we are looking here is nothing but embrace this change, because engines are going to come, the existing engines need to evolve and the platform need to be flexible as well. So it's all about adapting to this new change. And one of the key things what we look at with the open source element because most of the engines as you see is all open source and it's essential that we are not getting into the vendor locking aspects. Now, the same use case I just mentioned earlier, if you look at through the platform and please straighter view from the company, it will be looking like something like this. You have your compute at the bottom, you have the storage layer, then you have the resource management layer that was the yarn in the past, and then your metadata catalog, security elements, and then your applications or engines are sitting right on top of that. Finally, the business data or the business code will be sitting on top of this because it makes use of all these engines and derive some meaningful insights. I want to give one more view of the same stack but through a resource point of view. The same use cases that we saw earlier, like streaming use cases or batch use cases, I can structure the overall cluster capacity in this hierarchical manner and assign certain resource quota to it. So then those engines will be executing within that realm of that quota, what is assigned to it. So this was very common in the past decade and for many of the big data use cases. Now, coming to the challenges for managing such a big data platform. So I want to show four different pillars and through that pillars I would like to explain what are the challenges that we saw and how we are actually, we are going to solve with adopting Kubernetes and beyond that what other extra challenges that we saw and how we solved some of them. So that is how I wanted to take this further session. So the first pillar is resource management. So deploying different versions of the applications and with different types of dependencies were one of the biggest challenge for us. And it's not that easy at all when you run different versions of the same application, there will be conflicts and we run into many issues. And of course, noisy neighbor issue is a big challenge. The containers who are running on the same machine could impact the performance, SLAs and et cetera. And it was very difficult to autoscale in such workloads based on the workload demand. The second use case or the pillar that I'm looking for is scalable clusters. So from the cluster point of view, most of our clusters or all the clusters that we have is collocated with the storage and compute together. So this is nothing but the data locality that many of you might have heard about in the past. So you want to make sure that you send your compute to the machine where you have your data. So we will look at how we solve this issue in the upcoming slides. Now the third pillar is about operational efficiency. So it is important that like we adapt to the latest of the best of the platform without any doubt. And for that we need to make sure that our applications or our engines are easy to maintain, easy to deploy, easy to provision. But those were not the case in the past. Finally, from the hybrid cloud point of view, we need to make most of our application hard to be ported to public cloud and then private cloud, we need to make sure that the same need to be available. It is not that easy at all. So that was one of the other challenge that we saw. So these challenges that I mentioned that exactly led our journey towards Kubernetes. So we felt that Kubernetes is the right ecosystem, the right platform for us to make sure that the next generation data platform can be really on. So the data services that we call it as, the new data platform from our stack, we call it as data services. They are completely on Kubernetes and we see that that is the future towards the hybrid. So the cluster form factor that we had, it was very complex. It is very labor intensive as well and it was not easy to maintain. So we had to be replaced with the new stack that is nothing but Kubernetes and we found it's very simpler, more capable and it's easy for both data practitioners and the platform admins as well. So we rearchitectured our traditional big data cluster from a monolith to a microservice-based architecture and we also made sure that we make our overall servers into minor services like it's very easy for put all these services together and then make some relations out of that. And we also leveraged the container cloud and it helped us to desegregate the storage from the compute. That also was a major key point for us. Now the applications that we have, it has to be optimized to adapt and create the new application. So one of the challenge that we saw was it was not easy to create a new application and then make the new application faster, right? And for many of our customers, it was not easy at all. So the new platform we wanted to ensure that they can develop the application on top of the data platform in an easy manner. Finally, the maintenance itself. So upgrades were always a big problem. So we wanted to ensure that the platform itself can upgrade to zero downtime or with high availability. Okay, so the next step in this journey for us was we adapted Kubernetes, but the question was, did we miss something that we had it in the past or did we add some new complexity sort of challenges into our stack? So I would like to look at a few of them. The first one is about the hierarchical queues itself. So do we have the same flexibility available from the platform to manage the resources? So I will say we don't have yet. And the next point was about is our services multi-tenant. Is it easy to adopt maybe 2,000 or 3,000 plus users to our system and then run queries or workloads, et cetera? Can I also achieve the same SLA that I had in the previous stack? So that was also a major question. So can this service scale the way it used to scale like 20,000, 30,000 containers for running one Spark batch workload? And finally, is this cost effective compared to the previous stack? So if some of these are not met, that's a huge problem because then the new platform will not be helpful for us at all. And the major question for us was the data engines itself that helped to solve a ton of use cases just like the way I mentioned earlier. Is it ready to adapt the new platform? And yes, it was not quite. So Cloud resource some of them and let's see how we took the new platform itself. Okay, so the resource management is the first vertical. So Kubernetes hampered our data platform to support faster auto scaling without any doubt and it was cost effective. And now we are able to run containerized workloads. So that means we don't need to worry about any noisy neighbors and we can actually run any type of versions of application with its own dependencies. So it became very easy for to manage the dependencies with the containerized form factor. And we were able to solve the resource isolation in a much simpler manner than we thought it in the past. Now the challenges that we saw with Kubernetes. So the question was like how do we assign quotas to the teams or users based on the use case or how to submit a bigger batch workload and consider it as one unit? Or how do I maximize the resource utilization in a fixed size cluster? So in a cloud it's much easier. You can auto scale and come down. But when you're in on-prem your cluster is limited. So you need to make the best out of the hardware. So how do we do that? Some of the limitation that we saw with Kubernetes scheduler was the quotas that we set on the Kubernetes is just an add on. And the enforcements were only available at the resource creation time. And the quotas and limits were set at the namespace level. And the lack of application-aware preemption were a big challenge. So the elasticity was not quite fully achieved. So as a solution, when we thought about this resource related problems, we found that there is a gap and we open sourced a project called Apache Unicorn. So this is a batch scheduler for Kubernetes. And we open sourced it in 2019. So over the last few years, it became one of the powerful batch scheduler in Kubernetes. And we have very various deep dive sessions on this scheduler itself in the previous coupons as well. But I want to touch briefly about what it is. So Unicorn can schedule any type of jobs, be it batch, service, long running, it does not matter. It can support a huge scale of demand from application like Spark or Hive. It is very easy to integrate to any new engine because we are using simple annotations and labels. So it is not that you need to make a huge change in the big data engines or your own application to use a scheduler. And it does support hierarchical queues. So that means you can set the quotas at each level. It also supports various types of scheduling like fair ordering policy or FIFO or GAN scheduling itself. So I'm taking the same queue hierarchy that I showed earlier in the use case as a root level. Then you have the streaming and you have the batch. Then you have the warehouse queue underneath and ingest queue and finally the data engineering. So I'm submitting some jobs. So usually it will be coming as a board. The left, the warehouse will be maybe a Hive job or an Imbala job. When the data engineering queue, it could be a Flink or a Spark job. But overall, it's around, I think 20 GB in the streaming queue and 30 GB in the usage and the batch job and overall it is around 50 GB. But if I set my quota at the top level to 50, so let's see if I submit a job like say an ingestion job, a Kafka, it could be possible that we will not be able to run it. But rather than rejecting or failing this job, we will queue it up. So that whenever there is a capacity available at later stage, we will be able to schedule it. So that is one of the powerful aspect of Unicom. So you can set the resource limits at each level. So this hierarchy is very similar to the old form factor in yarn. If folks are familiar with it, you will be able to easily relate this. Now, let us look at the second challenge. This is related to the preemption. So in Kubernetes, the entire cluster is sorted based on priority. So it's a big queue and boards are considered for eviction based on the priority itself. So an opt-out is not quite possible. That means any of your ports could be eligible for preemption at any point of time. And it will be a problem for jobs like Spark because if you kill the driver or the originator port, your entire application will fail. So that was one of our critical concern. And one of other concern was like the priority class. It is a cluster wide object. So that means anyone can actually change the priority to a higher value. So that means we can have some rogue users. So Unicorn was able to solve some of these problems that I mentioned earlier. But to talk about the preemption in Unicorn, I just need to take a bit of a time to explain how Unicorn works. So unlike the default scheduler, which is using a FIFO queue, Unicorn works in the hierarchical level. So it's a parent-child relationship model. So the resource limits, configurations, and access control, almost all of them can be defined at any level in the hierarchy. And it will be inherited by the child queues. And preemption makes use of the fact that Unicorn allows specifying both guaranteed and a max limit for each queue. So when a resource utilization for a queue exceeds the guaranteed, it actually can take some more extra capacity available from the max limit, maybe because the sibling queues are not using it. So it grabs some resources. Now when the actual demand comes from the underutilized queue at a later stage, the overcommitted queue, those applications, we will preempt it. But the way we will be preempting is such a way that we will kill only the executors. That means the originator ports will be kept alive so that at a later point of time, when the desirable capacity is available, we will be able to resume the work and continue the job. So that is the idea. And also, it is possible that you can spare certain ports by marking these are non-touchable ports or something like that. It will be very easy for the admins to take control of the preemption in a much granular way. So here's a quick demo that I wanted to show. So this is one of our new platform from Cloudera, CDE, is a data engineering spark. And we are creating multiple virtual clusters for running different types of spark versions. So each virtual cluster, that is nothing but a spark two or a spark three, you can specify the guaranteed capacity that you want and you can also specify the max capacity. So you have VC1 and VC2, that is the two virtual cluster that I created. And both virtual clusters, I set 10 cores as a guaranteed and it can grow up to 20 cores if unused. So this is how the resource utilization dashboard will look like, the exact hierarchical. So you can specify the hierarchy based on your organization demand. I'm running some jobs first and the utilization I'm taking to 10 ports. So the 10 cores in the port and it actually meets the guaranteed capacity. So once the guaranteed capacity is met, so I'm running some more job, maybe a job number two, I think it will take another four more cores, goes up to 14. So that means I'm now a boom, I guarantee it. So at this point of time, if there is a demand comes from the second virtual cluster, that is maybe a spark three cluster, that's what I think I'm trying to do. We need to grab the resources back. So there are two ways we can do, if we enable preemption, it goes and kills the job and then gets the capacity. And if it is not, it will wait for the job to complete and get the capacity to the second job. So here you can see that the second job is now running from the second virtual cluster and runs are ongoing. So I'm trying to show the logs to see that we have seen some progress in the Spark job. So that is good. The second scenario is basically I want to say that, okay, I'm changing the max limit of the cluster to 12 cores. So that means I cannot grow beyond 12. So I'm starting a job in the VC2 and the usage is goes up to around nine. And I want to run more jobs with run 93, run 94, that means I want to see that when you meet the quota, upper quota, what happens to these jobs? It has to simply wait. So that is something like I'm trying to, you hit the max here with the 12 cores so that no more executors can run and we have some notices. This is nothing but from Kubernetes, we get the feedback that, oh yeah, it is in pending state. So let me move on. So this is related to the performance and scale part, right? So you are running Spark jobs or fling almost all the different types of data, workloads in the previous old legacy form factor and it can scale up to say 10,000 nodes if you want, right? So that is how the scale possibility of the previous platform. And you can run different type of workloads like Spark or fling. But in the new platform, of course it comes with its own challenges that we mentioned, but how are we fairing in terms of performance numbers? So some of the numbers that we ran was running some TPCDS queries on Spark. On the leftmost one we can see it is Spark on Yarn. We get around 48 or something like that, I think. The execution time and similar job, the same job with the same spec. When we ran on the Spark on Kubernetes form factor, we were able to get almost on-part performance or slightly better. And we ran a Spark three job and mostly the same thing. So same job, same data set, same cluster size. So we made sure that all those things are the same, but compared the results. So this gave some conference that we are not losing any performance, though many of the aspects in terms of provisioning of ports are slower and et cetera in the Kubernetes form factor. So this proves that Unicorn and Kubernetes together, actually modernized our data platform and made sure that I think our new platform is in the right direction and we are able to solve majority of the use case that we are looking for. Now let's look at the second pillar. That is the scaling cluster. So here, one of the major challenge that we were seeing that the storage and compute, they were collocated. So if the storage and compute are collocated, what will happen is that you need, you'll have more challenges in scaling this cluster because you may need more compute, but you don't need to buy extra hardware like SSDs or something like for your storage. It is not required at all, but because it is collocated, you are forced to buy the additional storage as well. Thus your TCO won't be great. So one of the fundamental thing what we have done is we desegregated the storage and compute and made sure that our engines are capable of accessing data from any location. So one of the question that we usually get is what about the data locality? In the previous decade, it was all about data locality. So what will happen to that? But I think that's a new architecture, right? Because you have the better network now, better IO, that kind of compensates the data locality. That means the performance won't be impacted as much. Now the speciality hardware itself, you have GPUs and SSDs, there are very scarce resources. How to make sure that the right application gets those dedicated hardware? So if someone else uses that, you will lose it. So when you're a larger organization and you have a massive cluster, there are chances that you will be losing majority of those scarce resource to someone else of a lower priority nature. So we need to make sure that we handle that. So that is one of the crucial problem for us. And finally auto scaling itself. Like are we scaling in a way that we have to scale? For example, if there is enough compute demand, then we can scale. But if we do not have, we should not be wasting a lot of resources. That means we need to shut down those clusters as well. So with Kubernetes, what we go to was basically a set of features, right? So the first one was the CSI plugin itself. It helped to ensure that we could now connect to any storage, be it in public cloud or private cloud, it doesn't matter. We use the CSI storage plugin so we can easily connect. And our hybrid story becomes much more simpler. And it helped to make sure that we get the best candidate for those speciality hardware like GPUs. For example, if I have an ML job, I need to make sure that only those ML job or users of ML will be getting those hardware. And it's also less administrative overhead for spinning up the clusters. And we can do this deployment very fast in Kubernetes. So that means our spin up time improved a lot. And determining the right cluster size. Usually it is very difficult, but with auto scaling, we were able to toggle that and figure out the right cluster size and what is the optimal usage pattern. So this is what we got from Kubernetes. And we had still some more problems. And how did we solve from cloud side? I think for the desegregated storage and compute, Ozone played a crucial part. So Apache Ozone is a scalable, redundant, distributed object store. And it can store up to billions of objects on prem. So that means we can pack more data per node. That means we can go way more dense with storage on each node. So that means you don't need a larger pool of storage clusters. You can actually condense them to few number of storage nodes. And you can have your compute nodes run on the side as many you like. So this will help the TCO play better because the cost of maintaining multiple clusters is now much low. So Ozone was a crucial player in this part. And we also had more development efforts went from our many of our developers to improve the cloud connectors. So many of the big data engines, if you use as it is in the cloud, you will not be getting the performance that we are looking for. So the cloud connectors team worked a lot. And S3A connector is one of the example where we had a lot of development and research went in. And that enabled to get a very powerful data access model for our big data engines in the cloud. And finally, this is how the stack may look like. Overall, you have the commodity infra, then you have the storage and condense, but desegregated. Storage powered by Ozone and the condensers are coming from the Cuban D-slack. Now let's look at the operational efficiency aspects. So some of the challenges that we saw in the big data platform, they were mostly about how to upgrade your services without keeping it in downtime. How to keep the service highly available. Is it easy for us to get the new applications? And we also saw some additional challenges from Kubernetes itself. The QQer Kubernetes releases literally hurt us a lot because now that means we need to upgrade faster. And it is not that easy to upgrade a big data platform. It is not easy at all. And evolution of these APIs, that means it can break a lot of APIs and that means more engineering has to go to make sure that various places in our code base is reducing the right APIs or not. So upgrade was always a pain for us. So the way we solved it, or the way we are solving it is basically with a bunch of principles. We made sure that we have a clean separation of KITS APIs usage from our core services and apps. And we tried to avoid mixing the APIs, the client libraries or the binary package itself from the source code. And that abstraction literally helping us to make sure that our upgrades are not that painful. And also, CSA storage, we made sure that we'll be using only for ephemeral purpose, but if we want any persistent storage, we use some cleaner data abstractions. And finally, the monolith and microservice architecture itself. The dependencies between the services are crucial. So when we, we may have a big data platform with maybe 14 or 15 of services running. So if you split them to say 90 or 100 microservices, it is not that easy at all. So it was very critical for us to make sure that the dependencies are called out correctly and designed correctly. And a feature like this help us a lot. Like for example, designing a choise mode option to make sure that a service can be brought down to zero. Maybe to do that, we may need to touch maybe five or eight services. So that principle definitely is helping us. But there's a lot of work we need to do even from our engine. So it'll continue. Finally, towards the hybrid cloud segment. I just want to explain the previous gen architecture where the data and compute were together. And we touch base on that. And the second layer is again, we try to desegrate them to storage and compute in VMs in the public cloud. And towards the new gen era, we are trying to make sure that we use a proper containerized form factor in Kubernetes to make sure that our gen three is what we are looking for. And to that, the cloud agility and flexibility on prem really helped a lot. And to run the workloads on the right infra, that is crucial for us. So we want to make sure that we have the unified deployment architecture. And Lake house that is powered by Apache iceberg made sure that the data lifecycle is very easy to maintain. And finally, the onboarding of the new applications is much more simpler now because it is containerized. We made sure that new applications can just come on and users can come in and then make the best out of our platform. So how do we see the new architecture? I want to start with platform admin. So they were one of the pillars. We were using our platform. So we wanted to make sure that their life is easy. And the citizen developers within that, our customer org, so they make sure that they use the platform provided by the admins and develop their applications on top of that. And finally, the business users within their org get that application and they make some more inferences, be it a quarterly business report or any of those things. So addressing all three user segments were very crucial for us. And with that, we were able to design something like a advanced hybrid platform, which can scale to any type of public cloud or private cloud with a proper governance layer and run different services instead of servers or application. These are nothing but the experience or services. And then give that as a platform for running your AA workloads. So this is the final platform that we ended up with. So with that, I will close my talk. If you have any questions, please. Hi, I was interested to see about Apache Unicorn. Did you ever compare the performance of that to Spark operator for running Spark batch drops on Kubernetes? Yeah, so the virtual cluster of the Spark experience that I showed, we were using Apache Livy. So Livy as an endpoint was used to submit the Spark jobs and then Unicorn was helping to schedule them. But we do have integration. The Unicorn schedule has integration with the Spark operator. So you can use the Spark operator itself. And with that, it will be scheduled by Unicorn. So we have a few blocks out how to use that. But internally within CloudRA, we were using Livy instead of Spark operator. Sounds good, thank you. Hi, so one of the challenges we see around adopting Spark on top of Kubernetes is around shuffle space, right? So in a hardware-based model, right? So we had local disks which allowed us to handle shuffle space effectively. So how do you guys solve it in the decoupled architecture where the containers might not have the same amount of... Yeah, that's an excellent question. So this is one of our concerns as well when we migrated to the new Spark on Kubernetes. I did not touch in detail about the storage layer. So we had, if you have enough space in your root disk, it's much more easier because then you can utilize that space. But majority of the deployment, you may not be able to get that much of root disk space. So what we did was we enabled the Spark 3 feature. The Spark 3 has a feature with dynamic partitioning. So you can say that in which partition that you can actually use your shuffle. So we mount some maybe an EFS volume, some high performance data volumes and then use that for the shuffle by making use of that Spark functionality. But it is available only in Spark 3. So that is one of the biggest challenge. Thank you so much. That's it.