 So, hello everyone. Yeah, how are you all doing? I know it's a pretty long day. I think we can get started just in time. So, I'm Dilesh Varanupoju. I'm a senior platform engineer here at Comcast. I have total 15 years of work experience, first five heavily into design patterns, Java, Spring Frameworks, and then later on moved to cloud-based technologies. And also, I'm proud to be part of an amazing platform architecture and engineering team. We are a small team, handful is like 10 okay. Engineers managing multiple platforms based on cloud foundry, Kubernetes to support wide variety of workloads and also abstracting multiple clouds under the hood, ranging all the way from private to public. And we also have a thriving developer community of about 1,500 Slack users and also have about 40,000 AIs that are currently running in production across all our CF sites. So that's all about me and my team. Just to get a little bit about you folks, how many of you are using any kind of serverless platform or planning to use any kind of serverless platform? All right, that's handful, very little. I'm not surprised. Let me ask another question. How many of you are using any kind of past platforms such as cloud foundry? Yeah, pretty much everyone does. But the question looks different, but I ask the same question in a different way. Hold on to that thought. I'll get back to that in a minute. So the question is, it almost sounds like yesterday, we all started with PaaS just trying to wrap the head around and then CAS and then CFCR. There are a lot of things going on. While all this is going on, we hear a lot of buzz around functions, serverless, fast platforms. What's going on? Is this another new shiny thing? Unfortunately, serverless PaaS suffers from another new shiny thing curse, while it's not really true. If you look at the timeline, PaaS, our serverless platform, was introduced back in 2014 by the team who are behind OAuth0. They kind of started WebTask and also hook.io. They are the first serverless function as a service platform. They kind of support Node and JavaScript functions few years later. In 2016, there was sudden explosion of serverless, fast platforms, both in open source as well as hosted solutions by all the major players. Since then, it's growing silently and organically, just like how PaaS did in 2011. It became silent, soaked for a couple of years, and then thought leaders and early adopters like Comcast kind of took it from there. You know the rest is history. So, well, all that sounds good, but what exactly is this serverless or PaaS mean? There's a lot of confusion around it. Well, in general terms, serverless sounds very synonymous with PaaS, function as a service and vice versa, but to be precise, PaaS is a subset of serverless. Serverless covers the underlaw of many other things, but PaaS is just a subsystem of that. We also, PaaS simply means that the end users of the platform, that is the developers, no longer have to think or manage about servers. In other words, if you're a platform operator or a service provider of serverless PaaS, you would be thinking about servers and managing it all the time. Well, that sounds like PaaS, isn't it? How is it different from PaaS? So, you could consider PaaS as the first iteration of serverless and PaaS as first iteration of serverless and PaaS could be serverless 2.0. These are some of the quotes that are taken back in 2016 by Dr. Zules, who kind of says either serverless is different than PaaS than you or I have misunderstood the terms. So, we can safely say that PaaS is serverless 1.0, where developers have to still think about how many AIs they need or what kind of Docker image that they need to push onto the platform. In PaaS, where it's serverless 2.0, they don't have to even think about it in terms of how much capacity they need in advance. That's kind of going a little bit advanced, that's all. Just another iteration. Well, all that sounds good, but why would anyone go through the trouble of managing platform and building the software in a new way when there's an already existing way that works fine? Well, the primary reason, the driver and the motivation is all about cost, right? Cost optimization, utilization, efficiency, and agility. The same reason why we kind of started with PaaS. Let's take a look at the scenario that would make it things more clear. So, this is the snapshot taken from one of our production environment, and this is one of our largest site. The first one is the traffic volume that flows, from 9 a.m. to 9 p.m., cycles through every day, and the one below is the total number of application instances, AIs running on the platform. Just to put things in perspective, there are about 4,000 AIs. That is roughly about 400 microservices with 10 AIs each. This is just an approximation, though, just to keep math easier. Not all the apps have the same number of AIs, some has more, some has less, but what is the first thing that comes to your mind when you see this scenario like this? With the corresponding DIPN traffic, did you notice any corresponding drop in number of AIs? No, so that means that that's clearly an opportunity for resource optimization. So, we are talking about not 10 or 100 instances, we are talking about 4,000 AIs. So, that's roughly about 12 hours of time, right? 9 p.m. to 9 a.m. That sounds a lot, and what do we do with that? So, definitely, it's a no-brainer. We could use an autoscaler that kind of detects the change in traffic pattern and ramps down the number of AIs. So, it could go down anywhere, more than half of them, about 2,000 AIs can be recycled and that kind of releases a lot of resources. So, autoscaler sounds like the answer, but the devil is in the details. What happens next day? 9 a.m. in the morning, the traffic kicks in, the autoscaler detects the change in pattern and it tries to bring one A, one AI or application instance per service. So, you're looking about 400 AIs that are trying to start simultaneously on the platform. And this can cause an event storm. We have seen that several times. And this event storm affects the response times of some of the apps that are on the critical path, which are kind of called by the downstream apps and customer-facing apps. And there are also other apps that are on the platform that are dependent on these apps, right? Overall, this is kind of a havoc, Kyos, in terms of increasing the response time. I don't think this is a great situation to be in, especially during business hours. But, well, wait a minute. Adding an autoscaler would cause such issues. That sounds odd, isn't it? Well, let's pretend for now that the autoscaler has no known bugs at this point. And it works well when it comes to scale. Just pretend. So, what else is kind of left on the plate, right? Probably let's take a look at the apps, what's going on under the hood. So, some of the behavior challenges that we observe are with CF push, right? This is not, again, a generalization. Some of the apps, again, not intentional. That slip through the cracks and make it to production. And the pushes take longer than three minutes, which is the default limits set by the platform. And somehow the apps get pushed successfully by increasing the timeouts. And then the apps that are running on the platform takes longer than a second to pass the health check. So, they continuously crash and the platform is trying to bring them up. So, and then there are some apps which they think they don't want to go through this ramping down, ramping up cycle. So, they kind of set the threshold way high, like 30, 50 instances in order to take the production load. And some apps go the other extreme. They just go to one or two instances in production. So, we have seen all kinds of behavioral challenges with respect to the apps, though it's not intentional. So, the other part I just mentioned is Autoscaler has some bugs and we noticed it and we kind of reported it. It's kind of being worked upon with the respective teams. So, in a nutshell, Autoscaler is definitely a great place to start. I would definitely recommend we should use it, but it's not the end of all the problems. That's the point I'm trying to make. Well, so now that the thing that is in our control is the app, so let's start looking at apps, what we can do in order to address some of the challenges that we just discussed a minute ago. So, we are all already familiar, right? It almost sounds like it started yesterday. Hey, let's start breaking down the monoliths and into microservices. Now we are saying, hey, let's go break down these microservices into something even smaller. We call them functions, some refer to them as event-driven microservices. You can use them interchangeably. So, basically what we are trying to do is we are trying to break down these apps into much more smaller pieces so that we don't run into issues of, you know, CI pushes taking longer, health checks taking longer, and also the scaling on the Autoscaler becomes a lot easier, especially when the app starts within milliseconds, within few milliseconds. So, one of the requirements for FAS platform is that the function should start within 20 milliseconds in all the established FAS platforms that we see. So, this kind of helps mitigate the issue that we have seen before. And also, the beauty is these microservices and functions or event-driven microservices can coexist together, and also you can chain them together just like how we are used to doing the unix kind of piping or chaining. So, there are very important applications that we can leverage. So, let's look at some of the serverless FAS benefits, right? Obviously, less is more is what we are talking about. We are going through breaking down these microservices or manageable services into much more smaller pieces. The whole idea of breaking down is the idea so that we can develop and deploy and scale them independently and much faster way. That's the whole idea. And also, it's easy to focus on working on one function rather than working on a monolith or a service. The other example, this is not perfect. Let's take a Rustful API. Let's say accounts, right? Slash accounts. A Rustful API typically contains five functions or methods, get all with filters, get by ID, create an account, which is slash post, update an account, slash put, and delete an account. And if you observe, at least from my experience, for the most part, 50 to 60 percent of the time, our posts are called heaviest. And posts are kind of in the middle. The delete is at the rock bottom. It's called less than five percent time. But still, we are packaging all these functions together and then deploying it. When it comes to scale, we are scaling all of them together when only gets are used heavily, right? So what if we deploy more instances of gets and five instances of delete that would kind of optimize the resources? That's one example. But what I'm getting there is I know it's not a perfect example. There are some arguments against it. How do you manage the database connections and everything like that? But that's just an example to start with. The most important reason is the cost and efficiency, right? We just talked about 200 AIs that are running on the platform, whether there is a traffic or not, right? That reduce of waste would translate into a lot of cost savings, which is the new currency, and it could save CAPEX and OPEX for a big time for any enterprise. The other important thing is the way the pricing is done, right? In past world, the pricing is done based on the number of AIs that are running. But in the fast world, you're essentially paying for the actual execution time, right? Even with the autoscaler, so what do you say your AI count? Is it 4,000 or 2,000? Because during the day, you'll be running 4,000, and during offers, let's say you'll be running half of the capacity. So that's kind of the calculation becomes tricky, but when it comes to fast, the pricing model is pretty clear. You're just paying for the execution time. But does it really translate to... Does it make really a huge difference? So I just went ahead and tried one of the AWS calculator, right? I just tried an AWS Lambda function, and then I used to instance that is running on AWS API Gateway. For the same number of calls that were made or a period of month, can you see the cost difference? A dollar to 8,000, isn't it massive? It almost sounds like, hey, it might be... It's not a big deal at the beginning, but look at the cost differences, right? It is a massive cost difference. So I think this is the most important factor that's driving towards going towards serverless fast, which is not something new, which we are already kind of doing it. It's just that we are a step behind. Well, like any... Like every other approach has pros and cons. Serverless is not the perfect solution. You cannot... Just because you can do even a RESTful API into your function, you should do it, right? Just because you can do doesn't mean you should do it. So there are always certain type of applications, workloads that can go run on the serverless platform. Some of the existing challenges, at least for now, is the complexity. Again, this is not something new for the fast platform. This is also an existing challenge for the fast platform as well. When you break down a monolith into microservices or functions or even driven microservices, how do you orchestrate them together? How do they talk to each other? It's always a challenge. But luckily, there are few open-source projects that are currently in progress. This is an evolving space. Istio and Envoy, you might have seen those talks trying to solve some of these problems of service discovery and sidecars. And the second one is the cold start. This is an interesting problem, right? When the fast platform decides that there are no requests coming to your container, it's going to tear it down. And whenever the new request comes, it's going to spin up a new container. What if there's a lag of a couple of minutes, right? So that translates to the latency involved bringing up the container again to serve the request. So this is one of the problems when fast was just beginning to start. But now I don't think this is a huge problem. There are different ways to keep the functions warm. One is manually, you can send requests, but I don't think that's the best way to do it. But now all the newer platforms allowing you to configure the interval, how long you can keep the function warm before tearing it down. I think there are ways to work with even cold starts. The biggest challenge is the tooling. As I mentioned, this is still an evolving space. How do we debug these functions? Because these are short lived and how do we monitor them? Audits, logs, integration testing. These are all kind of elusive at the moment. These are definitely need some maturity and tooling around it. So these are some of the main challenges when it comes to serverless. So is serverless being used anywhere or is it just in theory? No. There are already some use cases that are already in place. A lot of companies are using it. At least for the line of business we are using it. Conquest is in cable, networking, media, entertainment, and also home security, home automation. So some of the use cases that are relevant in terms of Conquest line of business, in terms of IoT integration is we analyze a lot of media and image analysis so that we can provide recommendations. And also with respect to customer service we need a lot of virtual assistant chatbots and serverless mobile backends and also some sort of machine learning workloads. These are some of the use cases that are relevant. Because when we talk about FAS for the most part everybody says, oh, I can use this for batch processing, scheduling tasks and jobs. But that's not exactly true. You can do it in multiple ways like with the IoT integration. The other space that our potential use case is the homes automation and security. All this is smart devices, plugs and bulbs and smart sensors. So we collect all the device telemetry data and then analyze it real time and also building smart homes. And you might have seen the Conquest Technology Center. It's one of its kind. Definitely this technology enables smart buildings. So by now a quick summary is that we started with the existing platform, some of the challenges that we are facing, what is next, how we can address these challenges. And then FAS seems to be a good fit for some of the workloads that I just talked about. So by now you might be thinking, serverless, right? Why not? Let's give it a shot. So there is no shortage of options, as I mentioned. In fact, this is overwhelming to go through the vetting process to find out the one that best fits your ecosystem. There are so many options, both hosted, as well as open source platforms. For the scope of this talk, because it's an open source conference, I would like to limit this for all kinds of open source platforms. So the vetting process or the key selection criteria that I went through is nothing new. The first thing that I looked is the trending. How many GitHub commits are there? How active the commits were made? How many contributors were there? And how many GitHub stars were there? And also I looked at the number of posts in Stack Overflow to see how the community is helping each other and it's growing. The next thing I looked at is the tools, because right now things like Docker is synonymous with containers, Kubernetes is synonymous with container orchestrator, Kafka is synonymous with messaging, though they can help much more use cases. I just want to make sure the platform uses a set of standard tools so that it's easy to learn, as these are pretty standard. The other thing is, while building, managing the platform, is it easy to get started with the existing documentation? So I looked at the existing online documentation. Is it just clear enough to get started? And while managing the platform or while running the workloads, just in case if I have questions, is there at least a Slack or an email support? It's another thing I looked at. And the other thing is, what is this platform written in? Which language is it in Golang or Scala, or what are the other languages that they used? Last but not the least, for all this open source project, who are the companies that are kind of backing up or supporting them that have successful track records? So this is like the bare minimum criteria, but you can add more, but this is how I started. Here are the numbers. Pretty much all my studies around this. This sounds like a table, but this is a lot of work. As you can see, OpenWisk has the highest number of GitHub contributors, and most of the contributors are from IBM. In fact, OpenWisk is the core of the IBM Cloud Functions. At least OpenWisk is supported by Apache. It's open source. The other thing you would notice is pretty much all of this open source platform started somewhere in 2016. And pretty much all of them written in Go, except for the OpenWisk, which is written in Scala. Pretty much all of the platform uses hardware and Kubernetes for packaging and deploying functions. And pretty much you can use Helm Chats to install any of them, except for Fission and Ion Functions. Most of the serverless platforms support serverless, are supported by serverless framework. Serverless frameworks is really a fantastic thing because you wrote your Cloud Functions in AWS Lambda. Now you want to run it on Azure and IBM are even an open source platforms like OpenFast. So serverless framework, as the name indicates, is a framework. It's not a platform that allows you to write once, but you can deploy it to any other fast platform. And as you can see, even by the number of Git commits made by the individual contributors are kind of more for OpenWisk. And some of the things around the technology is pretty much all of them are using Prometheus for collecting the metrics. I mean, it looks like pretty standard tool stack, pretty much all of them. But the other thing that you notice is the Knative, which is started somewhere last year, which sounds like a newbie, but it has catch up really fast. And then now it's compatible with its contemporaries. The reason is if you look at who are backing this project or sponsored by is Apache, Google, Peutel, IBM, and Red Hat. Pretty much all the companies, organizations that have successful track records are kind of backing this project. And it's kind of freely in full swing. But again, it's still an evolving space though. So OpenFast is also pretty good and it's comparable. It's backed by VMware. The CLI is pretty easy to use and intuitive. And OpenFast has a concept of API gateway. You can use the CLI to connect synchronously or asynchronously. Pretty much all of them are using NAT's messaging bus for asynchronous communication. OpenWisk, I think they have put a lot of focus in terms of scalability and resiliency. But the only thing is it requires developers or operators to have working knowledge of ZooKeeper, CouchDB, and Prometheus and things like that. And for the most part, I see some duplicate overlap of functionality when using CouchBase or ZooKeeper when they're already using Kubernetes. I think the other interesting thing I found was Cubeless. Cubeless is purely built on Kubernetes. It does not require any custom CLI. You can just use CubeCTL to deploy your functions. And they use custom resource definitions, CRDs. That's one pretty cool thing I find out, which is also the approach taken by Knative. Knative uses CRDs, custom resource definitions, and uses Istio and OnWi for service mesh and discovery. So overall, this is the summary. Luckily, there's nothing red here. Pretty much all yellow greens. The most influential factor here is the ecosystem. Because if you're already a part of the ecosystem, such as if you're already working on open-source CloudFoundry, it will be a lot easier to leverage the tools on it. Similarly, if you're already working on AWS tools, it will be a lot easier to use AWS Lambda. So ecosystem definitely plays a critical role in kind of coming up with what exactly you wanted to go with. That also helps with the learning curve. And when it comes to the type of languages that these platforms support, pretty much all of them are using Docker. That means you can pretty much package any kind of function that you write in any language into your Docker container and push it. So I think in terms of the language support, I think it's okay. When it comes to reliability is what I just talked about. Who is the docs clear enough? Is there any support with respect to Slack? Are these backed by any organizations that have successful track records and things like that? And the GitHub activity that's happening is an active community. So I think that kind of helps me to determine on the reliability factor. So this is not kind of a recommendation, but these are kind of observations. You can start from here when you're trying to choose your past platform. So the other thing I tried was checking the Google Trends. I just tried Google Trends and trying to compare with each of the existing open-source platforms. And this is what I find it. And as I mentioned, Knative is kind of catching up really fast and it's kind of on the top of the charts pretty much consistently all through when compared to all other piece and contemporaries. That's pretty interesting. So with all this study, I would like to present a demo. I chose Knative for the demo, but the demo takes at least couple of hours. So I kind of recorded it and squeezed it into like 10 minutes. So let's walk through the demo. So on a local cube, so I chose MiniCube to try this Knative offering to see how simple it is from end-to-end installation, updates, and then deploying the workloads and testing them, what kind of use cases it supports. So let's run through this. So at the moment I'm installing a MiniCube and then just checking all the versions. CubeCuttle status, then MiniCube is up and running. All the nodes up and running. The cluster status. And I'm using KL for Kubernetes tail. You can use Kubernetes logs as well. So this is the most important process of going through the Knative install, right? Knative uses custom resource definitions to install Knative. You could use direct download those YAML files and install it, or there is something cool called RIF, the project RIF by Pivotal. You can use that to go through just RIF system install. We'll do the job for you. The other thing I tried was, because I'm doing it in my local, I use the node port as the load balancer. So this is one way to get the Knative ingress when you're using a node port type. So when you install Knative, the successful installation indicates that you have build serving and eventing module namespaces installed. So that kind of tells you the installation is successful. So Knative uses these concepts called serving, build, and eventing. These are kind of really cool. Serving is actually, you can even use your stateless microservices in addition to functions. And eventing is eventing. You can use functions to generate events. And build is another cool thing. They have started supporting build packs that are already kind of on the Cloud Foundry. And also, you can use your Docker files to build images. You don't have to do anything. Kubernetes detects it using build templates called Canico. And then builds the Docker image, and then pushes it to the Docker repo, that internal Docker repo or public Docker repo, whatever that you configured during the installation. So this is just me trying the blue-green deployment in comparison to Cloud Foundry, how I can do this. So the same thing. I'm going to create a namespace. So I'm watching the Kubernetes parts. At this point, I'm building a simple Docker image that kind of takes a HTTP request and sends a response back. And I'm pushing it to Docker repo. So here's the important thing. All this code is on GitHub. It's going to be open so shortly. And this will be available so you can try it later. So all I did was, if you see that, that's all the installation is all about. All I did was apply config.yaml. It's the bare minimum that is required to deploy any microservice. As you can see, all the parts are up and running, and the deployment is complete. The parts are coming up three by three. And if you notice, there is something called revisions and configuration and routes. If you can see, the route has a URL that you can use to curl. So that's what I'm trying to do. Right now, I deployed the version one. The code is on GitHub. And my Docker image is pushed to Docker. I didn't do anything, right? So it's all the magic. So as you can see, now I completely moved to the version two. If you notice, I'm running this in a loop. This is what I meant by cold start. So usually these functions, when you don't invoke it and if you keep it ideal for a minute, the platform will tear it down. So you cannot really make a request. When you make a fresh request, it's going to take a little while until the container comes up. So for the sake of demo, I just need to keep these functions warm. So I'm just running in the loop. So you look at that. You can still deploy your stateless microservice. You can still do your blue-green deployment. And not only that, we also tried the weighted routing where I'm splitting the traffic 100% between version one and version two. I really find that really cool without doing much work. It is pretty much like CF, the latest CF demo that we have seen yesterday. So the next thing I want to talk about, how all this magic is happening, right? So what are the build packs and the templates that Knative supports? Knative uses Keniko build templates for now to build the Docker images. And it uses any of the existing Cloud Foundry build packs to deploy any other. You just push the code. Knative determines based on the code, whether it's Java or Node, and it looks for the Cloud Foundry build packs and uses that build pack to build it. So that's another thing really cool. I find it. So I'm watching for the pods. As you notice, there's just a Docker file. That's all I provide. So that is the custom resource definition that I have to apply one time. That's called keniko.yaml. So the most important thing that you would notice here, I'm creating secrets and service accounts so that I can connect to the Docker repo. I'm leveraging the existing Kubernetes secrets and service accounts. And then I get the Knative ingress the same way because I'm using Node port. So the code is in GitHub. And as you can see, the image is pushed. And now I can curl it using host headers and you can see the build pack is created. Now I'm using CF build pack instead of using keniko. As you can see, there's no Docker file here. This is, I think, a Node app. Yeah. This is a Node.js. Now let's see if it builds it or not. The same thing. I'm going to create secrets and service accounts, create a different namespace, set the Knative ingress. There you go. You can see the parts coming up. Yeah. So now the entire application is built using Node.js build pack. So the latest build pack is pushed to Docker repo and the code is here. So basically, all it's trying to do is it's pulling the source code from GitHub repo and it's checking whether it's a Node.js or Java app and pulling the entire source code and building the Docker image and deploying it. So this is using RIF. This is the most powerful use case that I tried. What I tried was I created a process, a function that kind of takes some messages from a queue and then encodes it and then puts it in another queue called encoded messages and there's another process or a function that kind of decodes this message and then displays it. This is a powerful abstraction of the use case that we can leverage using Knative. As you can see, I pushed a Java code so I'm able to encode the message that I'm sending. See, as you can see, RIF is... It's detecting the Cloud Phone build pack and it's building it. You already see the parts for message encoder. Now the decoder is coming up. Also, the interesting thing that you notice is the message channels. As you can see, plain messages and encoded messages. Those are the queues or channels. All right, so I'm creating a channel. I'm creating a subscription. As you can see, all those parts are available. So you have a messaging channel. You have a subscriber. You have functions that you can attach to. So this is the thing that I was talking about, ColdStat, especially to keep them warm. All the code is on GitHub. This is Spring Boot app, a decoder and encoder. All the doc images were pushed automatically using the Secrets in Service account. Now, for the sake of time, all I did was I was connecting to one of the parts within that cluster and I'm posting the messages on the queue that is on plain messages. As you can see, the encoder and decoder, the encoder is reading it, encoding it, putting it back on the encoded messages and the decoder is reading it and putting it back into the plain message. That's it. So thanks for attending and thanks for watching the demo. Thank you.