 Thank you. OK, so as he mentioned, my name is Alexa Griffith, and I work on the cloud-native Compute Runtimes Organization at Bloomberg. I joined the inference team, which focuses on building infrastructure for serving our AI and ML models around a year and a couple months ago. Speaking virtually with me today is Jenny Fu, my colleague, and she's also in the same organization as me, but she's on a different team called B-Pass, and that's Bloomberg Platform as a Service. This is also a platform to deploy services onto Kubernetes. So for those of you who maybe have just so happened to see any of my talks or blogs, you know that I love to draw poorly or have poorly drawn stick figures in my presentations to make things a little more fun, because why not? So just prepare yourself for these in the slides. So briefly, I want to introduce Bloomberg to paint a better picture of about the traffic we have and the scale that we are dealing with. Here's a little bit about Bloomberg. So we run on data, but we're a technology company. It was founded in New York City. I personally work in the New York City office, and Jenny works in the California office, which is why you will see some California oranges in the slides soon. We have a lot of consumers, around 35,000 subscribers, and Bloomberg employs around 20,000 people from all over the world. It's known for its product, The Terminal, which is essential to financial workers, but it's also a news company, and it actually employs more reporters than New York Times, Washington Post, and Chicago Tribune together, which is impressive. There are a lot of engineers at Bloomberg, over 7,000, and a growing chunk of those work on AI, ML, and NLP-type things like we do. So every day, over 300 billion real-time market data pieces are ingested at a peak rate of over 15 million events per second. We also have around 2 million news stories published every day, and over 2 billion messages handled daily on our platform. So per the slide before, you can imagine that Bloomberg runs hundreds and thousands of microservices. So let's explore what ways we can leverage the mesh to help us out a bit. As I mentioned, we received, and ingest, a high amount of traffic, so traffic control and security is very important and a useful feature for our platforms. In this talk, we're gonna cover two main topics about how we use Istio ingress and the egress for our access control. First, we're gonna assess our use case with ingress to regulate who or what can access a resource. And afterwards, Jenny's going to touch on how we use egress to check if something like a principal identity is authorized to access external services or storage. So we'll be doing this with the help of our very realistically drawn missecurity and some traffic symbols. But before we get into that, I want to discuss briefly what we're working on on my team and those teams that are very closely related to mine. We run our workflows on Kubernetes, as you probably have guessed, and we have a couple of different types of workflows in our AI data science platform. So my team focuses on building a platform for inference, which is basically serving models. These are long-running services that are powered by K-Serve, which is the air that my team owns and works on. And they can consist of multiple pods in a graph and run in a serverless environment using Knative. So as part of our AI platform, we also have training jobs, which are short-lived jobs that shut down once they're completed. Additionally, we have Jupyter notebooks where many of our AI and ML engineers run experiments. So each of these has slightly different access patterns, but all of these workflows are within our platform, and they have a unified goal of building a platform for AI and ML engineers to easily deploy their models and manage their model lifecycle. So all of these models have some similar things in common, right? They have strict auth requirements that we need to buy by, and we need to integrate with our own in-house Bloomberg permissioning setup. So we have our own roles and permissions that we need to integrate with, so we need some tooling that's flexible and allows for that. So our platform has three new use cases I mentioned before, but we're gonna talk just about two in this talk. So we're gonna focus on inference services to discuss the ingress features that we have, and second, we're gonna discuss how we set up the SDO proxy with Jupyter notebook, and then we're gonna talk, oh yeah, and then training. So let's for now solely focus on how the ingress benefited for our inference workflows and got us closer to achieving the security that we need. So here I have a general architecture layout of how ingress is used for our inference services. We currently have all of our inference service requests going through our SDO ingress gateway, which has virtual services for each inference service, right? So inference service can get a little complex. We have what we call inference graphs, where there are multiple pods composing of a whole service, right? And here we have a very simple common pattern, which we call a transformer predictor. We don't need to go into the details really about what a transformer predictor is, but basically what you need to know is your request hits transformer, which is the first pod, and it calls to pod two, which is the predictor, returns the response to pod one, and returns the response to the ingress. For each of these services, we might have replicas like we do in blue, and so one blue request will route to a blue replica and a green will do the same, et cetera, et cetera. So let's discuss a case study that highlights how using SDO's ingress has helped us with traffic control. So using this like silly drawing, let's dive a little bit into what we observed. A client was sending requests to our mesh and hitting an inference service. The client observed really long timeouts and failures or maybe some dropped requests. They were trying to understand why, so they came to us and they were like, hey, what's this latency about? Is it an inference service issue or is it a network issue in your cluster? So what we have basically here is like this road, right? There's no rules. Maybe a car crashed. We don't know if it made it through or not. Maybe it sped through really quickly. There's not a lot of traffic control. So this is a red light, right? Not good. We need to identify the root cause of this issue and we need a way to track what has crashed or stalled, what made it in and what made it out and how long that took. So how can we provide some insight into this issue? Luckily with SDO ingress, we have some out-of-the-box features we can use. So without us changing any of our code, what we can see is that there is an extra quest ID header that the SDO ingress used to label request. Now the client was sending the code in without this labeled. So on their end, they couldn't track it from the client side. But even without that set up, what we could do is we could still, from our side, identify each request within our mesh. So once it entered the ingress to when it exited because the ingress will set it if it's not set itself. So yeah, one way we can make this better overall is to ask the client to set the header with the request which ingress will automatically respect so that they can fully track and identify the request as it returns the response back and reaches the client. So this would give us a green light, right? As we're basically adding, kind of like slapping on a license plate to the car and or the request to identify it and be able to debug issues. So now we have all the requests with the proper headers using SDO ingress and we can track the progress of each request. And what's nice with that is that we can also observe the total time it took in our cluster from when it entered to when it exited ingress from the SDO response header, the ex-envoy upstream service time response header. So this was especially useful here and there are other headers as well. So now we have a green light. Now traffic can more securely be observed. We have rules set up where we have tracking with a request ID and a license plate. And with it we can find out information about the request like latencies. Again, this is really so nice because of how little code it requires. It's just out of the box observability within the ingress and it's really easily configurable within the client as well. So now we have a better way to debug and understand our system and the traffic flowing through it. So now I wanna pivot a little bit and discuss another use case, the Jupyter Notebooks. I mentioned previously that we use these for AI and ML engineers for experimenting and the way they access that is through a UI. So before the diagram on the screen was implemented, we used SDO proxy for authorization inside of our job controller directly. This was hard to maintain in clunky so we shared logic across teams. There was a lot of code duplication. It still worked, right? But I just say this to point out that one of the nice things about SDO is that we could change this whole logic and flow about how we want to set up our authorization without an actual change in the service code. So we wanted to move to an external authorization service to have a more centralized policy and management system so that we can easily manage policies external to SDO and protect our container without changing the rest of the system. So we implemented this authorization as a service. As you see in this diagram with Open Policy Agent or OPPA, so this OPPA service that we created is paired with SDO proxy. And OPPA is an open source general purpose policy engine that unifies policy enforcement. So let's take a look. Here the user is signing into a UI and using their account to create or access a notebook. We send the request to the Ingress which routes to our authentication service. So without going too much into detail here, the authentication service just follows standard OAuth2 and OpenID workflows with SSO. So the authentication service will return some OIDC token. This can be from any, like theoretically could be from any OpenID identity provider. So after that we route to the Jupyter Notebook Pod which has an SDO sidecar in it with instructions to route to our external auth service to authorize via the service. So we use our custom rego to write our policies. It's just that's something that's OPPA specific and that helps us to support our Envoy input. So we discussed two different workflows that utilize our SDO functionality and here I wanna highlight these workflows have two very different access patterns and that's really nice to be able to use SDO, the same tool for both. On the Jupyter Notebook side we have access patterns where the UI requires an OIDC token from a single sign on and user interaction. So it's interactive, right? You get a username and password exchanged for credentials. I also wanna point out just for the sake of space, I didn't fully draw on that side the authorization pod or the containers in the pods, but it's exactly the same. The architecture is exactly the same on both sides for authorization. We use OPPA as well. So on one side we have the inference service. The other side is the inference service which is an access pattern of a long running service that requires programmatic authentication. So the request comes with a token like a JSON web token, a JOT, and I'll talk more about an SVID token next. And when it's hitting the SDO proxy we call to our authentication service then SDO proxy will route to our authorization service. So yeah, both of these access patterns are handled without changing the services code, which we love. So for future integrations we can use Spiffy which is secure production identity framework for everyone. It's just a set of specifications for being able to securely identify software systems. And at Bloomberg we embrace open standards like Spiffy to support cloud environments securely, to securely identify each and every process without human intervention, align with open standards, avoid long-lived static keys, and to trust and federate across different environments. So if you're not aware how this works, just very briefly, we can have a Spiffy ID which is a URI. It has a Spiffy scheme which identifies a workload within a trusted domain. And a Spiffy verifiable identity document as an SVID contains a Spiffy ID and it's either constructed as an X509 cert or a jot in the trusted domain. So the workload will use an SVID as proof of identity. Spire is a production ready implementation of Spiffy specifications. So Spire is basically a Spiffy runtime environment. So when Spiffy is using conjunction with Spire, which attests the identity of a workload before issuing the credential, it provides a strong and secure identity for services. Together, both Spiffy and Spire provide a robust identity solution that can be used to secure services. So finally, we have reached the end of our discussion on Ingress and to review, we looked over a case study that highlighted some of the out-of-the-box benefits of using Istio to have more traffic control and observability. We also looked into ways that we use Istio proxy to easily configure our authorization policies and to plug in our open service to fine tune our AI workflows with various access patterns. And so next, I'm going to turn it over to my colleague, Jenny Fu, to discuss more about how Bloomberg uses Istio's egress. Thank you, Alexa. And hello, everyone. Today, we will see how the Istio helped Bloomberg platform team from zero urban traffic to full control urban traffic. Bloomberg platform as a service team, which is show for a VPAS team, provides a platform that helps Bloomberg engineers deploy their services to Kubernetes cluster on managed cloud. We have more than 30 clusters and every single cluster contains 75 to 200 working nodes. Every single worker knows will have its own IP address, which means every cluster contain a list of IP addresses. If we need to recreate the cluster for any reasons, those IP addresses might be changed. In the graph, you could tell in each cluster, we have multiple namespaces and the paths living in those namespaces will be randomly scheduled and deployed to different nodes. This is not the full picture of our architecture, but will help us to understand the following content. This architecture has perfect balance between Bloomberg traditional architecture and cloud native design. Currently, we have already onboarded more than 1,000 users and more than 2,000 applications. However, we receive more and more requests about the traffic to public internet. Unfortunately, we cannot support them easily. And why? Currently, Bloomberg security model is based on the IP address. That means anytime any user want to communicate with the public internet, the user need to go through the security model, a security team to do the reviewing and get approval. If any part in this cluster wants to open the connection to the public internet and the use case has to be approved, it means three things. First, security team will put a list of IP addresses into their allow list. Two, all other parts in the same cluster will get the same ability to the public internet without security review. And the third one, every time we recruit a cluster and if the IP address has ever changed, we need to add all new IP addresses again to the allow list which will definitely involve another round of security review. Those are big, big red lines. To make the current limitation much easier to understand, let me introduce some great helpers. Please welcome our Mr. Security. Mr. Security will help us review all the security requests and help us to keep our environment security all the time. And next, please welcome our Mr. Requests. This is our lovely friend in the neighborhood and our user in the system. Last, is the precious permitted California orange which will only be granted by Mr. Security after security reviewing, just like those approved network connection access. We can think about those. Mr. Requests wants a California orange so he asked Mr. Security if he could get one. After reviewing and evaluating the information that Mr. Requests omitted, Mr. Security said that, hey, yeah, everything looks good, here you are, and pass the orange to Mr. Requests. However, this permitted California orange should only be used by the Mr. Requests. But there's no limitation and restriction to enforce this happen. So unfortunately, Mr. Requests shared this permitted California orange with all his neighbor by accident, and all his neighbors didn't get approved by Mr. Security specifically, what will happen? So when Mr. Security noticed this happen, she's very unhappy to see that people got the permitted orange without her notice. So she just took them back and stopped giving anyone any further permitted oranges. So Mr. Requests is also unhappy about never getting an orange again. This doesn't look like something safe for every red light during the traffic. In the real world, our security team will never approve any request about connection open between BPAS environment and public internet here at the current situation. And due to this limitation, some users are not able to move all their services to our BPAS platform. That is a huge loss for us as a platform team. So we need a solution. As a platform team, we definitely want to keep growing and onboarding more and more users and applications. So we need a safe way to open the connection between our environment and public internet and make our security team and the users are both happy to see the traffic like finally functional working. Before researching a solution, we need to clarify the requirements based on our need. First of all, we should be able to deny everything by default. This is the base of our security model. Secondly, if anyone requests for either a California orange or open the connection between BPAS environment and public internet, we should be able to allow security team to review the request and grant the access based on the request. The last but not the least, we should be able to grant access safely. For example, when a user asks for an California orange, we should make sure they will have one orange instead of an apple or pear and only they could have oranges. No one else could have them by accident. After researching, we decide that issue-based solution. We will make use of the Kubernetes network policy and the issue authorization policy to control the traffic. We will dig a little bit more about each resource list here later. So Kubernetes network policy, that will help us control the traffic on TCP and UDP level, which is recognized as the network layer three and four. We will ask our user to provide it out host and or IP addresses that try to heat and we will add them to the network policy to allow the traffic from there. Here as the example of network policy, as you can tell, we will add the IP addresses to our allow list. We will use issue-aggress gateway to achieve the open traffic. Authorization policy is what we use to control the application level traffic, which also be recognized as network layer seven. We are going to give some example for a service entry and authorization policy. We need a service entry to register the host name that the user tried to access. And we're going to make use of the MTLS that is to support it naturally. So we enforce the HTTP traffic inside the service match. We have two type of authorization policy in the system. The first one is to help us deny all the traffic by default. The second one is to control the traffic on application level. In our case, namespace selector doesn't work while based on our architecture. Finally, we decided to create the customized service account for those applications that enable service match and use this customized service account as the identifier in the authorization policy to take charge of the egress traffic. The solution is perfectly fine to resolve all the problems that we have. And let's see the new security model that we could provide it with our users. Once the user notice they're half the need to communicate with the public internet, user need to submit a request to the security team. They definitely need to provide the information as much as possible about what are their applications doing, why they need the communication with the public internet, and which specifically holds IP address for as they may need to get access. Just like how you convince your boss that you need to attend the SEO day, right? After the request has been submitted, security team would be able to review all the details about the use cases. If the details are not clear, Mr. Security will ask more questions, provide some feedbacks, even request change based on the security concern. That is the reason we highly recommend that users should submit the request as early as possible. Finally, if the user are able to get approved or reject from our security team, we will go to the next step. If the request has been approved, we as a preference team provided the application that could translate, save, and apply those policies safely. We implement a YAML translator to help us to translate the human readable policies into SEO-appliable policies. At three buckets, the bumper supports are used for this application to save those policies for reference and disaster recovery. We also implement a magic deployer to apply those policies periodically and that includes reconciliation process to keep the traffic control up to date. That's all I want to share for today. Thank you so much for your time. I will hand over this to my colleague, Alexa. Nice, thank you. Yeah, so thank you so much for attending. I really appreciate it. Do we have any questions for Alexa? Yeah, this isn't really an SEO question, but you mentioned that you used Spire for your services to off against using X509 certs. I was just wondering, have you encountered any issues around repays or off-token rotations? And if you have encountered these issues, like for example, token expiry and stuff like that, if you have, how did you address these issues? Yeah, thanks. So the question was about using Spire or Spiffy Spire and having any token issues. So actually we have a plan and it's an initiative, we've been working on implementing Spiffy Spire, but as far as my team, it's a work in progress right now. So we actually haven't come into those issues, but if you have, that would be really good to talk about just in case so that we know as well. Yeah. Another question? Yeah, a quick one. In the previous presentation, you saw that the firewall and Istio are these tools managed by the security team or by your platform team because who has the ownership of the tools? That's the question, because the firewall is for security. So how they are sure that they can enforce the rules or if you violate it as a platform team, how do they allow you? Just a general question. Yeah, so the question was about kind of ownership of security things like firewall versus the platform. And I will say, I will defer that official answer to Jenny. She will be in the chat because that is her team's ownership and I think she can talk on that a little more, but Jenny Bloomberg, we do have a close connection between teams and our org. So we do have an SRE team as well that we work closely with. All right, and we have time for one last question here. Thank you very much. My question is so now, if I understand it correctly, the approvals will be handled inside the cluster. So does it mean the firewall by default allows all the egress traffic by default or do they still also have to open it by when a user opens a ticket? And by extension, what happens if you recreate the cluster, do they have to move the whole config? Yeah, it's a great question. So it's another question about the egress. I hate to do this, but I will defer it to the chat because Jenny's not here and that is her expertise. She really wanted to be here, but she couldn't for other reasons, but she will be in the chat. I'm sure she'll be really happy to answer that. And in case you're not already a part of a chat, that is channel number three co-located events on the CNCF Slack. Hope to see all of you there, getting more information, a huge round of applause for Alexa and for Jenny. And now I'd like to welcome Zach Butcher to the stage.