 Hello and welcome. Today we'll be talking about running machine learning workloads on a service mesh. My name is Sir Simran Singh Man. I work on the machine learning engineering team here at Splunk. I'm based out of Vancouver, Canada. My interests include distributed systems. I like the scalability aspect of distributed systems. I have some experience working on multi-cloud as well as security in the cloud. I try and contribute to open source as and when I can. I try and use open source components in my day-to-day job to solve everyday problems as well. You can find me on GitHub, LinkedIn, Twitter, or Medium with my username, her Simran Man. Today we'll start by going over some background on what kind of problems we are trying to solve here at Splunk and how service mesh help us achieve some of our goals better and at scale. We'll also talk about the journey that we went through, the transition that we had adopting service mesh into into our workflows and the benefits that we got as a result. And we'll also go over some of the obstacles and challenges that we that we ran into while doing the implementation. I'm sure many of you would have run into similar problems. So we'll see some techniques that we used and I hope that it will be helpful for many of you in other scenarios when you try and move existing workloads onto a service mesh. To give you some background, about a year ago we decided to re-envision our machine learning offering here at Splunk. Splunk has grown massively over the last few years and the customer demands for more intelligent systems have grown as well. To cater to those demands, we decided to re-envision our machine learning offering and we had a few clear goals for the project. First was we wanted to run at Splunk scale. So for those of you who are using Splunk systems, you would know the kind of scale that Splunk operates at. So we wanted our machine learning offering to be at par and be able to handle data that thrown at the Splunk systems. We wanted to build a system to allow the data scientists to come and experiment with the data, build their machine learning models and then we also wanted to help them operationalize them. So we wanted to take these models and deploy those models in production and then we also wanted to build a system which has a low barrier to adoption. We did not want to start from scratch and create yet another machine learning system. We wanted to explore and see what people are comfortable with, what kind of tools they're already using so we can build upon those tools. So while doing the research, we looked at a few options in the open source world and we narrowed down our focus onto Jupyter notebooks. They came across as a very convenient way for data scientists to use to experiment with machine learning algorithms, to try out a few things, to do a different kind of data manipulation and preparation for machine learning and all of that. So we decided to take that as the base and build that offering around that project. Before we jump into like where we are here at Splunk, let's start off by going over a simple definition of machine learning. So I stole this from Wikipedia. It says it's the study of computer algorithms that learn through experience and what it really translates into is this. So we start by defining a problem and then resource some data for which we're trying to solve this problem. We'll massage this data a bit and then run some algorithms on it. That's the modeling part and once we get the model as a result, we would do some evaluations on it. At this point, we might decide to go back, maybe rework the data, then retrain the model. So once we are satisfied, we will take this model and deploy it onto production where this model can now do really smart things with the data it has not seen before. So what does it really look like in practice? Here at Splunk, we got a polyglot when we tried to build our machine learning environment. We started building it around Jupyter notebooks, but soon we were writing services in Scala. We had components in R, a lot of GoLang services. Python is heavily used for the experimentation side as well as for data preparation. And we use Apache Flink to parallelize our data processing jobs. We run on top of Kubernetes using an Istio service mesh. And all of this is closely tied to Splunk which is our central data repository. So we have data flowing between all of these different components and we use our machine learning models on top of this data. But this wasn't always like this. We started off with a very simple Jupyter Hub deployment. So Jupyter Hub is a project that helps run Jupyter notebooks on Kubernetes. It has two main components. One is a proxy. The other one is the Hub. Hub takes care of things like authentication and handling user requests. Proxy takes care of routing traffic to the Hub or to the user notebooks depending upon where the request needs to be routed to. In order to better understand Jupyter Hub, let's do a deployment on a Kubernetes cluster. I have a Kubernetes cluster running that we can use to deploy a Jupyter Hub. Let's look at different components of Jupyter Hub that we'll be deploying today. This is a project that helps deploy a Jupyter Hub. You can find this on my GitHub. Links would be shared in the slide. Like you saw in the diagram, we have two components, a Hub and a proxy. We have an additional component here called namespace that would create the Kubernetes namespace for us. Let's look at all the components that are going to deploy to the cluster. We have a config map, a deployment, a secret, some rback related components for Hub. For proxy, we have a couple of services, one called the public service, other one called the API, and a deployment component. Let's do the deployment. As you can see, all these components are being created in the cluster. Let's take a look at all the new components that we just created. Here you see, we have three services, two pods, a couple of deployments, and their plica sets. Let's see what it looks like in the browser. For that, I'll need to put forward to the proxy public service so that we can access this in the browser. If I go to the browser and open up this URL now, I should see a login page and I can log in here and you'll see that a new pod is being spawned and we get a Jupyter notebook. In the backend, let's see what it looks like. Let me see all the resources once again. Now, as you see, there's an additional user pod. This is what is responsible for serving that notebook UI in the browser. All the requests that you see in the browser are being routed through this proxy public. When I did the login, it went to this hub service and once the user's notebook was created, all the requests started going directly to the user's pod. I hope that the demo makes understanding this diagram a little easier, but you all must be wondering by now, what about service mesh? That's the exact question the Splunk security team asked us when we tried to move this service to production. All cloud services in Splunk are required to ensure end-to-end network traffic encryption. And for us, this was the foray into service mesh. So, service mesh as described by Stardocs here is used to understand the network of microservices and how they interact amongst themselves and how the traffic flows between one service and the other. And when we got this requirement, the first thing that we did was to go out to the community and see what options we have. And this is what we found. So, there was an open request on the Jupyter Hub Kubernetes community to support securing connections between different Jupyter Hub components using Istio. And that's exactly the kind of problem we had on our hands. I did some reading and I thought I was ready to enable Istio on the namespace and things should just magically work. But our life as engineers is not so easy. Everything broke when I enabled Istio on the namespace. So, I had to learn how to troubleshoot the service mesh. Every time I would solve one problem, I would turn into the next roadblock. But slowly all the pieces were coming together to get a working application. This is what the application evolved towards. So, there were a few issues the moment we enabled Istio on the namespace. The first issue was the proxy itself. It couldn't route the traffic anymore to the hub or to the user pods because there was some header mismatch on the outgoing request which Istio sidecars didn't like. And they would simply drop this request thinking that there is no destination route for this request. I fixed this by rewriting the proxy to override the host headers appropriately and things seemed to work. The next issue we ran into was the moment a user would log in and try to spawn a new notebook pod for themselves. It wouldn't work. And the reason was that hub was not able to connect to this pod to see if it was alive or not. And this caused it to think that the pods were not being spawned and it got stuck in an infinite loop. To mitigate this, we introduced a Kubernetes service to front all the pods. This required patching the hub itself. So, all the traffic from the hub to the user pods would flow by Kubernetes services. We had a working version and we deployed it and gave access to our users. Everything worked fine for the first few days. But as the number of users grew on the system and the usage went up, users started complaining about connectivity issues to the service. We were getting reports of more and more of issues like these. When I did the troubleshooting I found that the root cause for all these issues was the proxy itself. It was not able to handle so many user requests and was failing to process them. And because of all the hacks that we had done to get the system to work, it was not possible to add more of these proxy pods to handle more traffic. So, we had to go back to the design board once again. And this time when doing the analysis, we looked at the service mesh for scalability needs. So, we ended up removing this proxy altogether and replacing it with one simple API. So, all the traffic would now come in through an Istio gateway that would be routed to the user pod or to the hub itself based on the kind of request. We introduced a new service which was compliant to the hub proxy API to configure Istio. And all of this was done in a couple of days. I don't think this would have been possible had we not relied on what the community has built in terms of service mesh, all the cool features that it ships with. So, this is what it really came down to in the end. When hub starts, it registers itself as one of the routes on the Istio gateway to handle all the traffic coming to Jupyter hub. Once this route is ready, users can now come in and start requesting their server pods. When a new request comes in to register a server pod, hub would go create the user pod and also create the corresponding Kubernetes service. And once it knows that it can talk to this user pod via that service, it goes to the proxy API, which is the Jupyter hub Istio proxy in this case to register a new route. Once this route is created and is ready to use, the route is returned to hub which then redirects the user on to the user pod. So, the architecture is very similar to what it was before, except that we replaced a few components and we got the system working on the service mesh. So, we got mutual TLS so far and we got the scale that we needed. Let us see all of this in action. We are going to go back to the application that we deployed earlier and enable Istio on the namespace. Before we do that, let us take a look at all the changes that we will have to make in order to get this going. Cuback diff allows us to see all the changes before we deploy them. So, here you can see we have a few components that would be changed when we do the deployment. As we talked about earlier, we no longer need the proxy public component, as Istio Gateway would take care of routing the traffic now. We have some changes to the port for proxy API. This is just to satisfy the port requirements for the new service. Then we are also updating hub with some of the routing information to tell how the traffic is being routed to it and how it should behave with respect to the incoming request. On the proxy itself, we are passing in some additional configuration. We are passing in the name of a new Istio Gateway as well as the namespace where the new virtual services would be created and some other configuration that we need. We also update the proxy command to be the starting point of the new application instead of the old proxy that came bundled with Jupyter Hub originally. We are also able to make some additional changes like how we restart the parts and things like that which were not possible before we moved our implementation using the new component. We are deploying an Istio Gateway component. So, this is to intercept all the requests coming in to the Gateway and to route them appropriately using the virtual services and the destination hosts that we will configure on these services using the new proxy API. We are also deploying a peer authentication component to the namespace to ensure that all components they have mutual TLS enabled and we are using it in the strict mode which means that all components must use MTLS when talking to each other. The last change that we are making is to enable Istio injection on the namespace. This would mean that all the pods that would start in the namespace would now have the site cards injected into them. Let us go ahead and deploy this change. So, all these components are now updated. As you can see, we deleted one service and we created a few new components and some other components were updated. Let us take a look at all the components that we have here. As you can see, our pods are now running with a 2 by 2. What this means is that two containers running inside every pod, one of them being the actual application pod, the other one is the Istio sidecar that takes care of all the networking bits on this pod. We also removed one of the services and there are no additional components as of now. Let us also take a look at any virtual services that might have been created as a result. As we see here, there is one virtual service already created and if you remember from the diagram when hub starts, it would register itself as the destination for all the traffic coming to that route. So, let us see what this rule looks like. If we describe this virtual service, we see that the destination for all these services is the hub service and we are routing all the paths on this route. Let us go to the browser and try to use Jupyter Hub now. We did not need to do any port forwarding because we are directly talking to the Istio Gateway which is running at localhost eddy in this case and I can log in and you should see the service being spawned and so the user gets their notebook once again. Let us go back and take a look at what is going on under the hood. Let us take a look at all the services that are now running in the namespace. We have the new user pod running. As you can see this also has two containers, one of them is the Istio sidecar, the other one being the notebook server itself and we also have a new service now to front the user pod. So, every time hub would spawn a new user pod, it will also spawn a corresponding service. This allows this hub pod to talk to this user pod and ensure that it is up and running before it starts configuring the network on the Istio Gateway using this proxy which is now just the API that listens to hub and would make a call to the Kubernetes API to create new Istio virtual services. We can also take a look at the virtual services. As we can see here there is a new Istio virtual service that was created. Let us try and describe the service. This looks very similar to the service that we saw earlier. The only difference here is the URI prefix and the destination. So, when a request comes in for slash user slash username, it will reroute the traffic appropriately to the user service which is running under jupiter dash username as we saw in the list of components. So, what this allows us to do is to delegate all the network scalability needs to Istio. We are able to use the sidecars for mutual TLS whenever one component talks to the other component, all the traffic is encrypted between them and we are able to use Istio Gateway and the concept of virtual services configured with appropriate destination host to route the traffic. This allows us to run our services at scale by delegating the network scalability needs to the Istio service mesh. Before we go back to the slides, let us quickly verify that our traffic is indeed routed through the sidecars. We are going to tail the logs for the Istio proxy container inside the pod. As you can see here the sidecar intercepts the outgoing request from this pod. So, this pod is continuously sending updates to the hub service and all those events are logged here in the sidecar. At this point we were really comfortable working with the service mesh. We were now looking at ways of using the service mesh to solve some of the other problems that we had in our system and this is where compliance and audit comes in. So, once we had the service running for a while and we got it on mutual TLS, we wanted to make sure that we do not regress on these things. So, we need to put in the checks to ensure that our services are compliant and stay compliant. This is a prometheus metric for Istio that we are continuously collecting from the system which tells us the state of mutual TLS. Here you can see the connection security policy is set to mutual TLS and we can set up our alerts on this to ensure that this policy is enforced all the time. We can look at the services that are involved and also where this metric is being collected. It is important here to know that the connection security policy is reported correctly on the reporter type destination. For every request there is a source and a destination. If you if you look at the request that originates from a source where the reporter is equal to source, you might not see the correct connection policy because at that time it is not known whether the connection would be a mutual TLS connection or not. So, it is always beneficial to look at the reporter destination. We had started this project with a very small team, but as the project scope expanded more and more engineers joined the team, we had the need to be able to iterate fast on the product. This meant that we needed to deploy and we needed to deploy fast and allow engineers to test their features in a production like environment and also to be able to demo these features to the stakeholders. We again used features from the service mesh to build out these capabilities. We extended the features that were built around STO virtual services and destination host to allow selective routing on deployment. So, this allowed every engineer to deploy their component and test them in a production like setting. Service mesh also gives you additional benefits around things like rate limiting. If you recall from the diagram that we shared earlier on the polyglot of services that we have, it is really hard to go and implement the same features like rate limiting, retries on each and every service. With service mesh, you can retry failed network requests. You can put rate limits so that one service does not abuse the system and you can also do interesting things like fault tolerant. This helps you inject faults into the system and see how your system behaves. So, all these when combined together makes for a really pleasant engineering experience because you are focused on delivering the features and not worried about building a lot of infra that we can build using a service mesh. And another important gain for using a service mesh is around telemetry. So, as all the traffic is now flowing through the service mesh, we can easily see what is going on in the system. So, we can see which service is talking to which other service, we can see how many requesters service is making, how many requests are failing, how big of a data we are transferring and things like that. While moving to a service mesh wasn't our first choice, it all turned out to be a blessing in disguise in the end. In conclusion, I would like to say that a polyglot environment needs a service mesh. The learnings that you have from putting one service on a service mesh are transferable to all other services that you are running in the system. Although troubleshooting can be a little tricky and you can run into some really challenging scenarios with the network, once you are past it operating your services at scale in a secure manner becomes really really easy. If you are not running a service mesh in your Kubernetes cluster, I strongly encourage you to give it a try. It can help ensure that your engineers are focused on building the features that your customers want. And also, you have a very good understanding of your network security as well as you are able to scale your network. If you are interested to learn more about the nitty gritties of how to troubleshoot and assist your service mesh, I encourage you to check out this repository on GitHub, which has links to a blog detailing all the steps that we went through while troubleshooting, the challenges that we ran into. I am sure you will be able to use some of these learnings in ensuring you are able to move your workloads on to a service mesh. Thank you.