 Hello, everyone. Today, I will be speaking on dynamic and context of air security policies for cloud using Service Mesh. This is a work done by me and my teammate Julian Stephen. We both work at IBM Research New York USA. And today, Julian will be joining us through a video recording as well. Containers are the new normal. And WebAssembly is the new future. As containers are increasing in number, as organizations are shifting from monolithic to microservice architecture, containers are increasingly increasing in number. And as these containers are increasing in number, so is the security standards increasing for these containers. As security standards are becoming more demanding, so is the need of more complex policies to protect cloud. So there are four Cs of cloud native security. We approach security of cloud through four dimensions of code, container, cluster, and cloud. At the code level, we ensure that security has been practiced at the implementation level. At the container level, we ensure security in terms of image loading while in the deployment. At the cluster level, we majorly focus into Kubernetes. And we talk about different aspects of security, like network policies, pod security standards, inter-service authorization, authentication, and many more. In this talk, we would be majorly focusing on inter-service authorization. But the solution that we are presenting could be generalized to different aspects of network policies, pod security standards, et cetera. At the cloud level, we ensure security at the infrastructure level. For example, AWS, IBM Cloud would be ensuring security at the infrastructure. Why do we need inter-service policies? The major reason is, as the vulnerabilities and attack patterns are increasing, every day we are adopting new third-party libraries. And this is leading to more vulnerabilities and changing attack patterns. And with these changing attack patterns, the attacks are evolving every day. Why we need what this inter-service authorization will help in is it will prevent lateral movement of data. For example, we can protect sensitive data. In addition to that, we can also limit the damage that has been done during container break-ins. As well, we can also apply rate limit on services. For example, how much sensitive data a service can access through the other service. But as these policies are coming into picture, why don't we see those policies every day? What is the development challenge in applying these inter-service policies? Major reason is complex manual intervention. So since we are talking about microservices at a very large scale, we need to ensure policies to be implemented at each service level. And doing it every day is very cumbersome for our chief security officer to do it at every service level. And apart from that, like with the change in dynamism of cloud, the policies parameter keeps on changing over the time. And if we rely on manual intervention, it will become a major problem in applying these policies. It's also very difficult to identify threshold for these policies. And in case there is some security standard that changes as in the organization, we will need to apply these policies at the manual intervention level, and that becomes very cumbersome. In addition to that, we also need a very fast in-path evaluation. So whenever an access is made, it goes through the cycle of policy evaluation. And if that becomes very slow, then the whole process becomes highly non-performance. So till now, we have discussed about the motivation behind why we need inter-service policies. Next, what we will talk about is, what are the solutions that we are looking forward? We have majorly categorized two goals. One is easy policy evaluation, and another one is practical policy evaluations. So at the easy policy specification, what we want is we want the policies to be implemented at service level, and it should be an automatic way to implement those policies. In addition to that, as cloud is changing, we want our policies to adjust to the changing cloud context. And in addition to that, we also want our policies, if we have presented a solution, we want our solution to be generic across cluster. In practicality, what we aim for is centralized administrations so that we twindle one parameter, and the whole policies across the cloud is reflected. Next, what we aim for is fast policy evaluation that these policies are highly performant. And these policies should also be providing defense against real-time threats that comes every day. The solution that we are looking forward and the way we are approaching this solution is through Istio Service Mesh, Open Policy Agent, Metrics, and Machine Learning. So before we dig deeper into the solution, let's first talk about what's Istio Service Mesh. So Istio Service Mesh is a platform to manage, connect, and secure microservices across distributed platform. It is widely popular, and there are more than 12,000 consumers of Istio Service Mesh, majorly because it could be applied in diverse cloud environments. How does it work? So it is divided into two parts. One part is data plane, and another part is control plane. Within the data plane, each service has been deployed sidecar. These sidecars are composed of envoy proxies. These are highly-performant, intelligent C++ proxies, which are responsible for controlling and mediating traffic between services. These are also responsible for collecting and reporting telemetry logs. The control plane is majorly the controller of these sidecars, and it is responsible for the discovery configuration and certificates of these envoy proxies. So we have seen that through Istio Service Mesh, we get the advantage of telemetry and traffic management. In addition to that, we also get an advantage of extensibility and security. By extensibility, we mean we can extend the functionality of Istio Service Mesh. That is, we could create some custom plugins and extend the functionality in the policy enforcement part. And by security, we also try to implement some identity and authorization services. Next, we talk about Open Policy Agent. Open Policy Agent is a general-purpose policy engine. It is also open source and widely popular. It unifies policy enforcement across stack. The major function of OPPA is that it decouples policy enforcement from policy decision. So a policy is written in rego. It is a high-level declarative language. And the context for the policy is provided through the data section. So whenever a request is received by the service, the service queries to the OPPA server. The OPPA server evaluates that request against data and policy and then gives a decision. As OPPA and rego both are domain agnostic, these are widely popular. Now my teammate Julian will join us for explaining the solution architecture. Hi. Hello, everyone. This is Julian Stephen. Now that we talked about Istio and OPPA, I would like to explain how we can build on that background towards our goal of dynamic context-aware policies. Towards this, we will look at our high-level architecture. We will go into the relevant metrics that can be captured and curated and the actual policy enforcement itself. Finally, we will look at how we keep these policies current or relevant to the current context of the environment. So let us build on the Istio basics we looked at. Let us focus on the shaded region here for a moment. It is a common and easy pattern now to route the service traffic that comes into the workload sidecar proxies to an authorization server. This authorization server can look at each individual request in detail in order to make an authorization decision. For example, if it is an HTTP request for an API, the odd server can decide whether the request should be allowed or not based on a host of parameters. Some of those parameters could be in the request and some could be based on the context that we spoke about. Finally, being a decoupled policy server, OPPA can play this part very well. We will go into details of each of these in the next few slides. Now, in the beginning, we also spoke about the need for policies to be simple and adapt to the environment. We achieved this by incorporating what we call a context server in the design shown in this purple box here. It is a job of the context server to tune the policy according to the current state of the environment. Not including all the messy fine-grained details and policy definition also help us keep the policy nice and simple and clean. Finally, to the left now, what is this context? How do we get it? How do we identify what context is relevant for policies? Fortunately, current cloud systems offer substantial telemetry about events that occur within the cluster. And if we add on additional layer seven details that we get from the site guard proxies, we can get a pretty good sense of what is really going on. In my experience, I have not seen these metrics exploited beyond creating beautiful Grafana dashboards showing node statuses and the like. But we can do more with those. We can query the metrics required for our policies from the metric server, Prometheus in this example. And we say what is relevant in a view. We say what is relevant in a few different forms. We will talk about the details in the next couple of slides. All of this becomes the base, and the context server will use this base to provide appropriate information to the policies when needed. Now let us look into the details of each of these boxes that we saw. We will start with metrics collection. What kind of metadata can we reasonably assume to have and how do we save it? By default, without an additional configuration, the on-way proxies exports a standard set of statistics. This typically includes statistics about source destinations, volume of data that is sent, received, et cetera. There are also on-way access log filters that can give you additional information like user agents or HTTP request response codes, connection, termination details, and more. And remember, all of this comes with no additional overhead to the application developer. This is all taken care as part of the infrastructure setup. In addition to these standard metrics, depending on the kind of application, we can get metadata with more semantics by using application-specific proxies. There are already proxies for MySQL, Redis, and more. HTTP proxies can give you details of URL, request response, your request source principles, details of JWT tokens, and so on. You can imagine if your application is using MySQL as a backend, for example, then the MySQL proxy can give you the actual queries the application is issuing against the database. So if you want to enforce policies specific to a kind of data being used, PII data, financial data, for example, this should take new significance for you right now. You can have one set of policies for PII data, another set for financial data access. Now, these metrics can always be pulled from some standard metrics server like Prometheus with an appropriate configuration. But we found that it is easier to pull the specific data that we want and store it ourselves for a variety of reasons. Part of this is just performance. Being able to pull the relevant data already summarized in our knowledge base helps us curie much smaller amounts of data much faster. Another part of this is just practicality. Time series databases typically have retention policies that is much shorter than what we would like. And if we were to set longer retention time, performance often is a cost that we'll have to pay. Now that we understand what kind of metadata we have access to, let us look at what kind of policies can be enforced and how. Be hinted at the solution already in some of the earlier slides, but let's take a deeper look. If you look at the diagram on the right, we have service traffic coming in, denoted by number one here. This request is first intercepted by the sidecar proxies and then forwarded to the authorization server, which in our case is an OPA GRPC service. All this is pretty standard and you can deploy these things very easily with boilerplate demos. The OPA server now makes a decision whether to allow or deny the request and the decision is sent back to the proxies. It's number four in the picture here. And the proxy either replies with a 404 to the source saying the request is not allowed or if the policy decision isn't allowed, then the request is forwarded or sent to the workload port for actual servicing. What we are doing, which is a bit more exciting is when the OPA is making this decision, it can rely on current environmental cues. In other words, this decision can be made in the context of current environment status. This environment status could be anything from past access rates or application or service behavior patterns, workload characteristics like CPU storage. For example, the number of requests serviced in one past hour, in the current past hour may have a bearing in allowing or denying the current request. Other interesting policies are based on the context server automatically figuring out common service call patterns and disloving services that do not fit this pattern. We also talked briefly about making policies simpler. Let us take a moment to look at an example for that. Imagine we want to force something like a lease privilege principle for all service to service calls. That is, we want to have a policy which makes sure only the services that absolutely need to call one another as part of the design are allowed. If I were to implement such a high-level requirement, then I would have to sit down and write policies that look like my Node4j front-end application can only talk to the backend API server. My backend API server can only talk to my database and nothing else, that'd be policy two and so on. This means we will have to create these mappings one by one, and this is a very tedious and time-consuming process, but we can leverage the metrics that we have already collected to create these mappings automatically. And once we have these mappings, we can inject these mappings as dynamic data into our OPA server. So this is a good practical example of using dynamic context to our advantage. In this case, to enforce lease privilege access for all services in the cluster, we can use this dynamic context. The Rego policy itself in this case will be very simple. It will just say only services defined in our mappings can communicate with one another, and these mappings are auto-generated. This keeps our policy simple and easy to write. And also, this makes a Rego part of this very generic. From the perspective of a security officer or a CSO, these lease privilege Rego policies can be reused in as many clusters as needed. Finally, there's a last bit of performance trade-off that I want to speak of. In this example, the OPA server was deployed as an external service. We can also have this as part of the workload sidecar itself. The trade-offs here are obvious. We get faster policy authorization times of our policy response times are faster, but we will have to manage our policies better, making sure the policies related to different workloads are located in the appropriate, the correct OPA servers. Finally, let us look into the details of the context server itself. We described many of the relevant pieces already, so I'll be quick on this slide. We are curating context by collecting data from different systems, and based on the kind of analysis we do on the systems, we can support different kinds of dynamic policies. Here, I will mainly speak about the different kinds of analysis that the context server can do. We can start with some of the simple behavioral analysis, the services service mappings that we already spoke about. This is denoted by this LB lease privilege mappings in this figure. We also found some practical quirks that we had to work around. All of us know that the cluster IP or pod IPs or the workload are ephemeral and can change when the application restarts. The service IPs are more static, but often when the proxies intercept the data, they will lock the cluster IPs and not the appropriate service IPs. So we found that we have to maintain a history of these applications to IP mappings based on timestamp to get some of the behavioral patterns that we talked about. This is another example of the kind of context that we maintain and can be utilized in many cases for many policies. Another obvious application is to create dynamic rate limits. An example where we can do more complex machine learning based analysis is to automatically identify the appropriate thresholds for many of these rate limits. We can try to learn the rate limits based on past behavior, which we will describe in a little bit. We can also envision adjusting some of these policies based on perceived threats and risk levels of your organization. This is not something we are doing right now, but it is definitely yet another interesting direction. That is all for me. Back to you Sridhi. So next we talk about dynamic policy threshold estimation. So we know that the dynamism and scalability of cloud demands a lot of problem. So what we need as a solution is an automatic and intelligent way to do these policy threshold updates. Also our solution should have memories of past activity and what could be the best solution than LSTM. LSTMs are long shot memory networks. These are a type of recurrent neural networks which are capable of learning long time dependencies. These are responsible for time series prediction. So we are using this particular quality of LSTM in dynamic threshold prediction. As part of the algorithm, first part what we are doing is we are classifying and counting the inter-service accesses. So by classification and counting I mean that whenever a request is made, request is classified either into read, write, update or delete categories. And in each category access counts are found. This particular data is supplied to the LSTM and then LSTM works on forecasting model and predicts threshold for the next time when. This particular policy could also be, this particular algorithm could also be applied in case of PII. So for example if you want to count your access and classify your access at the granularity of personally identifiable information, example you want to see which service is accessing, which last name, how many times. So you can count those accesses and then provide it to the LSTM. And LSTM will give us a forecast of what the threshold should be in the next time bin. Let's see how we are doing it. So as Julian mentioned we will be taking metrics, passing it to the ingest server then passing it to the knowledge base. After the knowledge base, the classifier will pick up this data, will classify it into the four categories, will pre-process it and pass to the LSTM, train it and we will get a trained model. We will look deeper into how pre-processing and featureization happens. So imagine you have a time series data of from 8 a.m. to 9 30 a.m. You divide the entire time series data into bins of five minutes each. Within each time bin you for in this case we are calculating the read access count. So each blue dot in a bin denotes the access count for that particular duration. Now we take a sliding window approach. So for LSTM we need an X and Y label. The window is of size three. So we provide data for this time window. And then we create a label. Since we are doing a supervised classification we create a label X4 which is the predicted count for the next time window. And then we provide this X and Y label to the LSTM for learning. Since now we have a train model at the inference stage when we are doing policy evaluations and policy creations, this train model is passed through a pass data. And this train model gives us a prediction of threshold in the next time bin. As a request is received by the open server it is evaluated against the threshold counts. And if access is allowed then the decision is made. How we can use this predicted threshold algorithm in case of policy enforcement and evaluation is this. We have it at the inference stage. We will take data for past time. We will pass it through the train model. So as you can see from 940 to 945 a.m. we have a predicted count. In the real time we will also keep on counting the number of accesses made. And we will compare the predicted count with the real time accesses. If the error threshold between the predicted and the real time count is more than what is required then it will produce a deny as could be evident in the time bin of 955 to 10 a.m. We can also look deeper into the policy of how we are providing the dynamic context. So if you see the service access threshold map this is nothing but the threshold map between a service and allowed count for that service. And this comes from the LSTM model. And then we have a service access count map which is basically the real time access count for a service. With this we come to the end of the presentation and now we will get to the demo part. How is the demo set up? We are not showing the LSTM threshold prediction but in this demo we are majorly focusing on the dynamic policies created for least privileged access. The application that we are taking into account is book in for application. As part of this application the book in for page talks to product page reviews and details application and all of the communication in this application are protected through MTLS. The policy that we will be creating this policy is created in Rego. The dynamic context that has been provided is in terms of destination to source map which is nothing but a source map of which source is allowed to access a particular destination. An example of that is as evident. We have product page that could only be accessed by Ingress Gateway. Reviews can only be accessed by product page, et cetera. We evaluate our policy enforcement through against log4j attack and let's see if that works or not. So as first part of the demo we are trying to show how log4j attack is successful when no policies have been enforced. So here first we are showing how book in for application work so you can see it talks to the details microservice, reviews microservice and gives reviews of a certain book. Next we start a malicious JNDI server. So we look at what are the pods which are deployed as part of the book in for namespace and we see that there is a vulnerable app that has been deployed there. So as part of log4j exploit this vulnerable app will talk to the JNDI servers to execute arbitrary code. So as part of this attack what we are trying to do is vulnerable app is trying to talk to ratings microservice which is not allowed and it is trying to base 64 encode the get request to ratings app and get data related to ratings app. So now when we issue a curl request to the vulnerable app we will provide a base 64 encoding to the vulnerable app. So you see the curl command was successful and we get a hello world. In the JNDI server we get a 200 response. When we check in the vulnerable app if the file was made and if the data from the ratings app was stored or not so we see that a sample file has been created and let's see the content of that file. So we see that the attack has been successful. Next part of the demo we are focusing on enforcing our policies and applying our context server to get the source destination mapping and let's see if that happens. In this part we will basically be starting our source destination recording and providing that dynamic context to the policy. So we start our source destination recording. So you can see source destination recording is up. Then we try to curl the data context of OPA server and we see that the destination to source mapping has been recorded. So here we see that details could only be accessed by product page V1. Product page V1 could only be accessed by Istio and Gress Gateway. Now in the other part we will see if we issue the same curl request will the attack be successful or not. So we issue the curl request, it looks okay. Now when we check at the temp folder there's no file that has been made and book info application is also happy. Like what we mean here is that without with the enforcement of our policies book info application works the way it is supposed to be but the log for J exploit has not been successful. So this brings me to the last part of the presentation. We can apply this technique for different part of policy definitions at different aspects of cluster security. This process could also be applied for app behavior modeling and we can use it for detecting malicious attacks. We would also be extending this particular technology this solution to identify if there are various policy needs in different aspects and we can apply the same thing of threshold prediction. And the major limitation or improvement that we want to do in the future is quantifying impact which is whenever a prediction is made there are various external conditions that keeps on changing. So in the forecasting we need to take into account what are the external factors that are changing within an organization for example the risk threshold might change. So in the future we will also be taking into account all these factors. This brings me to the end of the presentation. Thanks everyone. Please let me know if you have any questions. So in our first experimentations we actually used authorization policy in the predefined authorization policy but it has certain limitations. With this particular component of including OPA server in policy evaluation path you can extend the functionality. For example when you want your threshold counts to be calculated as part of LSTM you can include that as part of the policy evaluation. So whenever a request is made to a service that gets redirected to the external authorization server and this server can perform whatever you want them to perform. So it's basically extending the functionality which you normally don't see in predefined authorization policy of Istio. Any more questions? It depends like all the LSTM models suffer from a concept drift like every machine learning suffers from a concept drift. So what happens is if we see that the error threshold that I was talking about, if it is varying a lot then we can set up an algorithm which could retrain the model assuming that there might be a concept drift in that case. So right now we haven't implemented that part but I am assuming that we can do a statistical analysis of that concept drift and keep an account of that in our evaluations. So we are still working towards like open sourcing it. We haven't open sourced. We just like file a patent for this. But in the future we are trying to release it as an operator in OpenShift. So we would probably have documentation related to that.