 I'm about to tell you a story of three engineers, a DevOps application and data, all three of them of different needs and different views. And why can I tell you their story? Because I had the unique opportunity to play all three of them. Hello, my name is Laurent Bivas and I work for Iguazio. What I tend to call myself a joker engineer. And as you can see, I'm really into Legos, but this is as far as they go down to this presentation. For the past decade or so, our big data world has changed dramatically. Since its first release back in April 2006, Hadoop really defined the big data era. These were the days network was slow, memory was expensive. And we needed something to process our ever-growing data. We believed that by bringing the compute closer to the data, we will solve all of our problems. We remodeled our application into MapReduce pipeline and it actually worked. And it was fairly easy to deploy these scenarios. Our big data world has grown. Our software needed a lot more than just MapReduce. Our users demand a lot more analytics. Memory became slightly cheaper, so we had the idea of loading everything into memory and try to compute it as much as possible in there. We worshipped the RDD that brought us. We even started storing some of our data in non-Hadoop, no SQL data source like Redis or Cassandra, or maybe Elasticsearch. But we needed a lot more. We placed Kafka, RabbitMQ right next to our APIs. Our application flow was running in growing amount of data. We needed a method to process all those incoming events. We built event process pipelines with spouts and bolts. We added Flink and even toned back to Spark and got our notion of streaming with micro-batching. Overall, all those frameworks were basically a way of bringing more and more data into our big data world. Now that we were so good at collecting data, we need to figure out what it's mean. Presto was added to the mixture. And with its current solution, we turned to machine learning, later adopted deep learning. And in these days, it seems like many companies, when I attention floater, they're ever growing technology map. And deploying this infrastructure became real hard work. A lot of companies actually base their offering and how well they can actually deploy this technology map into your organization. Even where I work at Iguazio, we provide Hadoop compliant APIs because we deal with companies that require big data solution. The so-called cloud era didn't change much of what we knew. Even the major cloud provider had some alignment with Hadoop ecosystem. Major cloud provider have Hadoop integrated with their services, most of them have an implementation of Hadoop compliant APIs. If we'll take, for example, we'll take Amazon, the leading cloud provider. We have S3 working as a Doop compliant file system, EMR, which is our last Amazon elastic map reduce, runs, of course, Spark, HBase, and Yarn. And later on, they added, of course, other capabilities we see. And of course, if you have TensorFlow, Amazon has a pre-configured AMI running for you on AWS. And we started streaming stuff through Kinesis. But as you can see, nothing really changed. The hard work of actually deploying all those tangled technology map replaced by the actual visibility of choosing which serverless data service you're gonna choose. And trying to figure out what's your invoice by the end of the month gonna be. And no one actually can. The cloud brought us to the serverless era. Our data was in the cloud. We didn't have to worry where it was. It was simply there, or occasionally it wasn't. And the real major leap was the application-less code. This is when Lambda was introduced. We were promised a simple way to process incoming events with just code. Well, if you're familiar with Lambda, this is the promise we got. We actually got something else. So let's do a short recap. Our data flow has obviously evolved. We added so many big data frameworks to our toolbox. These frameworks were scheduled with different schedulers. Spark, for example, was using Yarn. Some of them were using Mesos. And the application sometimes uses something completely else. And now with the introduction of new architecture and new analytic tools, we need to rethink our ecosystem. Look it back at our three engineers. We need to re-understand the current needs. Application engineers want what every application engineer wants. They want agile development. They want to release as frequent as possible and get the user feedback as soon as possible. Data engineers are pretty simple. They simply want stuff to keep working. They don't care how, but keep it working and have their data available. And the DevOps engineers want their tools to keep working, but have less maintenance. They want an easier way to manage all those services and application and frameworks. And they will not manage multiple clusters with multiple scheduler. We must consolidate. Thanks to the Continental Revolution, we have an easier way to deploy these scenarios. A new ecosystem to work with. Our combined toolbox should run on a single scheduler. And of course, since we're at KubeCon, I'm suggesting Kubernetes. So let's first review our data. We at the Guizer look at our data as unstructured object store, structure store, and streaming. This is basically how we look at our data throughout, even if it's cloud or even if you're on-prem. This is your data. Decoupling the data from our entire system requires a shift in the current mindset. Our ecosystem should grow from a dupe mindset of distributing the data itself to distributing its access. When you run in the cloud, it allows you to access your data in a distributed fashion. Think of S3 or DynamoDB or Kinesis. You don't have to worry where the data is. You simply access it from anywhere. And even if you don't run in the cloud, you can access your data using some services like LusterFS for objects. If you keep your data in Cassandra, you can access it without much problem. And your Kafka cluster can be accessed from anywhere. This is what people are actually referring to as cloud-native. These are resilient and always accessible data services. In between, we have the orchestration. Kubernetes will schedule our application, analytics tools, our frameworks, and manage our entire configuration. When we align everything to a single unified orchestration, we need to adapt the application layer. We require the utmost layer to be upgraded. We can simply run anything on top of our orchestrator. This framework and application must be cloud-native. They have to use every tool Kubernetes has to offer. Besides our application big data frameworks, we can have leveraged serverless frameworks. We can now have our function service, our very own Lambda. Our big data pipeline must evolve as well. And actually, it already has. It's no longer a pipeline. It's a living system. Our system and services are constantly processing and analyzing data, accessing the data simultaneously, running function, microservices, analytic tools. We have data coming in or pulled out from IIT devices, external sources, dashboard, and many more. The entire layer around our data runs on our unified orchestration. Data is being accessed from anywhere at any time. Another evolution we see when moving to Kubernetes is using serverless frameworks. Like I mentioned before, the Qubeless, OpenFast, Nucleo. This is, of course, a blessed move. We now no longer need to worry how to build Docker. We no longer need to understand how to deploy it, how to run it on Kubernetes. We simply write a code, and the function frameworks will simply deploy it for us. It will compile us. The code and everything will simply run in our infrastructure. Another good benefits of function is basically you don't have to bind to any specific language. Your entire stack can be polyglot, and it already has. But using serverless frameworks is the way, but it comes with a great price. Usually it comes with a slow performance, slow development cycle, and we're mostly limited to HTTP endpoints. Not really our very own lambda. Look at all those serverless faults. We, DeGuazio, decided to build a real-time serverless platform, Nucleo. Nucleo's platform can have any event source, not just HTTP. They can be, of course, combined. You can listen simultaneously to Kafka, to HTTP, to Kinesis, with the same function code that you wrote. This allows you a better debugging, testing, and execution cycle. Since we also run everywhere, not just Kubernetes, you can deploy and test everywhere. Your sole focus is on writing the code, and the rest is taken care of. We even provide you with built-in metrics and logging. This was open source recently, and you should definitely check it out. We also provide, now with a hackathon, we should definitely check it out. The price is a very high-end drone. So now we're probably saying, OK, I listen to your talk and everything containerized. I'm using function. I'm using microservices. My Spark is containerized. Great. But now I have thousands of containers running on Microstrator, and each one of them is going to open a connection to my data service. It's going to create a huge load on our system. And you're actually right. And when I said we need a cloud-native framework and application, it meant that sound frameworks need to evolve as well. At Iguazio, we are dealing with very large clusters and needed a way to optimize data access. We build a solution around share memory, which brings the data directly to our application memory. The solution works closely with Kubernetes to allow shared connection, fast data access, and faster connection initialization. Let's take, for example, Spark. We created a data frame, an implementation that reads from a shared memory populated by a Vithrael daemon running on each of the nodes. This daemon is the sole owner of the outgoing TCP or RDMA connection to the data service. Now, if you're not running with something natively that we support like Nucleo, Spark, Hadoop, and others, we also have this same solution available as a fuse mount, which you can use a flex volume to use in Kubernetes and read directly with your application. Now, just like with Nucleo, this entire work was open source, and you can check out the solution and work some ways out to actually average it with your data services. We do hope that other data services will offer such acceleration in the near future. We've talked a lot about how we need to look at our data, our application, our frameworks, which new frameworks we need to add. Now, let's look at our deployments. I'll assume the DevOps had for a minute, but remember how hard it was to deploy and manage complete clusters. Let's look at Spark example. Every aspect of the system is managed by Kubernetes. Deployment, services, and even the configuration itself. And ConfigMap is not just for flat configuration. Start-up scripts are easily managed in ConfigMaps. And I will later on demonstrate to you how I can leverage a lot of the tooling Kubernetes has to offer to manage your entire deployments. And how much simpler is using Helm versus Chef or Puppet? You don't have to use another language. YAML, which you have to use already because you're using Kubernetes, is being utilized by Helm to basically now describe your deployments. Another demo that I'll demonstrate is how our current pipeline looks like. Like I said, a current pipeline is a living system. The data is simultaneously being accessed for multiple locations at once. So I'm taking a real example from one of our customers. This is an IoT automobile company. They have their cars sending information constantly where the driver is, some metrics of the cars, and so on. And everything is being processed in real time and injected to our data services. And simultaneously, there are dashboards showing where the driver is. What are the alerts for that driver? And there are also data engineers trying to run analytics on the same data that just came in. Everything is being processed simultaneously. Let's jump into the demo. So what I have here is a completely new Kubernetes cluster running in one of our data centers. I'll hope you can see it properly. So the first thing I'm going to do is create a new name space for this new customer. And since I can't show, I can see what I'm typing. And of course, I'm going to make a mistake. Great. So we have a new name space to facilitate this new customer. And now let's create something that allows that new customer to access its new name space. I'm going to do something that it's not something you should do. I'm going to provide it with the option to be a cluster admin. Usually we provide the option with a frangrade binding. But just for the sake of the demo, we create a new role binding. And now we'll do Helm install of our Wither.io Demon. OK, so what I'm about to install in the name space of KubeCon, I'm using the Wither.io Demon chart, which is available for you to use, and pointing to one of our data services. Once I hit, you immediately see that it's being deployed and in a few seconds it's operational. Now I want to show you what I meant when I used the config maps, not just config maps. It can do a lot more. I'm simply going to describe one of the config maps that I'm using. I forgot to update my context. OK, so I'll move just so I can see as well. What we have here is raw configuration, as a JSON file, bring, save directly as a config map. It's not the usual way to use a config map, because usually people tend to use it as a map, a key value, a simple place it as a file, and then map it to the container. Another thing that I really love doing is the initialization script. I'm also placing a config map. It's allowed me to better control which parameters I'm doing initialization and not by overriding all kind of YAML's files. Now let's add Spark. Again, like we did with the vitreo daemon, it will simply run in a matter of seconds. Very simple. Now what I'm about to show you is Nucleos function service. We have a playground for you to actually deploy functions. It'll be simpler to, I can deploy any pre-loaded function or, of course, provide with my own. Hit deploy, and it will simply ship to the cloud. In this case, our Kubernetes cluster. But this is not how you usually want to do stuff. You don't want to open another IDE or another tool. You simply want to use either kubectl or other command line tools to do the automation for you. So, of course, we do have an automation tool for you, which is R. Now I'm going to deploy a function. Just like you see in the UI, I'm going to deploy a function using our new CTL command line tool. It will build, take the code, build into container, do all kind of tests for it, and ship it to a registry that we define, and it will be running in our Kubernetes cluster. Now I also deploy a UX, a UI for our demo. And while we are looking at that deploying, there's a simple map. Currently, there's no data because we didn't stream anything in. We just launched the application, where there's a dashboard waiting for data to be streaming in. So let's streaming some data. It's just a streaming, but the real issue here is that now the function that I recently deployed starts to receive the events and populates all the real-time data on the map. This is, of course, with a lot of drivers being hammered into the system. But as you can see, everything is constantly live during the presentation. Now I have two way of accessing the data. One is the function, the other is the dashboard. And I mentioned also that data engineers might want to use, let's say, Zeppelin to run Spark jobs. So let's use Zeppelin. And we'll create a simple, let's call it KubeCon. And now we'll do a simple Spark job just to show you how I can access the data simultaneously as it's coming in from the function. I have a very simple job, run some analytics, and show the results. This is, of course, a call styles of Spark, so it might take a few more seconds. But still, the data is coming in. It's been constantly processed by the function, constantly being read by the dashboard, and also by Spark job. This is what I meant a living system. Everything is constantly being accessed. And we have the output of a single driver that fits the criteria. Now during the talk, I said it's going to be easier to run with Kubernetes. We have a much easier deployment. And no one actually stopped me and said, this is not easier. You actually keep running Kube CTL. And you're using new CTL. And you're still adding things, YAMLs, and init scripts, and stuff like that. And too bad no one interrupted. But actually, yeah, this is not the way to do it. This is not the way to deploy it. This is mimicking the old way of doing bad stuff on your deployments. So what you should be doing, let's kill the current application. I'll use the helm to kill everything. I'm simply going to remove Spark, our daemon, the function, everything that we just installed. And I'm about to show you how to really do it if it's working. So when I said that we should leverage our tools in our toolbox, helm is not just for presenting how easy to install with helm. It's how to install a complete application with the helm. So as the cluster is shutting down, what I'm about to show is that Nucleo provides you with a YAML. A YAML is something that is common to Kubernetes. This is a native YAML to Kubernetes, meaning you can edit it and really deploy the function over and over again through your functions. So taking some time. But meaning that you can take that YAML and use it in helm using all the templates in the helm provide. And now you can launch everything with a single command. Instead of all the stuff that I just typed in, which is, I have to stress out, this is the wrong way of using Kubernetes. I hope no one took pictures, but this is actually the wrong way of using it. So now let's see that Daemon doesn't want to die. So we have a clean cluster. And now this is the way to do everything that I just mentioned, including the role binding, including everything that I just mentioned, namespacing and everything in a single command. This is the proper way. This is something that if you're being around the talks of helm and how to use Kubernetes, in a single command, we're going to deploy our function, our Daemon, Spark, Zeppelin, everything that I just showed you in like five minutes of work now is a five second of work, until everything is running, including the new function, including Spark, including Zeppelin, and our enhanced Daemon. So few tips and lesson learned we at Iguazu had along the way. The first and foremost rule, and I can't stress how important it is, is you should really read the manual. And I'm neglecting the F, but you should definitely read the manual. I've seen too many hacks people are trying to do with Kubernetes. And the manual is very comprehensive, very easy to follow. It's sometimes hard to follow because it's length, but it's not something that you say, oh, it's very, very difficult. Simply do a copy-paste of commands a lot of the times. Second is the community. Kubernetes has a great community. It's not limited to just GitHub. You can find the Slack channel. You can find the groups. Everything within the community is helpful. But it's come with a special note. Like Kubernetes, the community is young. So sometimes you might get help that's going to contradict the manual. Try to recheck everything that you're getting help to. Know the tools that Kubernetes has to provide. During the development of many scenarios for customers, you don't know how many times I use port forwarding just to check if the pods is doing what I expect it to do. Do logging. Collect everything that you have using Qubectl. Qubectl is a great tool. You should really understand every other option that it has available. Also, one of the faults that most people are having in Qubectl is when or not to use the minus O, the output. Because the output of YAML allows you a lot of time to understand what happened to the service or what happened to the deployment. And not all commands accept the minus O flag. And like I showed you, always navigate with Helm or other solution that you choose with, but stick to it. Helm has great options, great understanding of Kubernetes. And as you can see, once you really understand how to use it, our entire application stack has been deployed with a single Helm command. And doing upgrade with Helm is even easier. And like I think that Kelsey mentioned, don't ever do Qubectl edit. Don't ever do SSH into our Qubectl exec to edit your containers. Simply use Helm. Another tip that when you deal with large cluster and many applications sharing the same cluster is do not overuse the node port. Node port is great when we are doing debugging. It truly is great, because you now know where to access. But if you stick with static node ports, it's meaning that you'll start having to manage all those node ports. Use load balancers. They are great options within Kubernetes. Now the sixth item might look trivial, but configuration must go in config maps. Don't try to force them into all kind of solution that I've seen is loading files from the host path, but someone has to populate that file, which everything goes to config map, which is very, very easy. And a specific case to a config map is dealing with YAML. YAML is very, very tricky syntax. So when you try to override the command, you sometime end up with result that you didn't expect. So like I've showed you, place init script and call them instead of trying to do some wizarding with YAML. And the last, it's not really related just for Kubernetes. For any deployment, large clusters, you should really collect operational data and not just collect it. If you just collect it, you're done nothing with it. It doesn't mean anything. You can simply shut it down. You have to collect it. You have to understand it. We are big data engineers, so we need to understand what big data means. And if the data is meaningless, shut it down. Thank you. Do you have any questions?