 Awesome. Welcome to the CNCF end user lounge where we explore how technologies, cloud native technologies are adopted by end user organizations across different industries and sectors. The CNCF end user community is formed of more than 160 vendor neutral companies that use open source software to deliver their product. I will work to dig and go to CNCF and today with me is a team from Social Security who will be talking with us about their journey in the cloud native ecosystem. In these live streams, we bring end user members to showcase how their organizations navigate the cloud native ecosystem to build and distribute their services and products. Join us every fourth Thursday at 9 a.m. Pacific time. This is an official live stream of the CNCF and as such it subjects to the CNCF code of conduct. Please do not add anything to the chart or questions that will be in violation of the code of conduct. Basically, we are respectful to everyone and especially fellow participants and also the presenters when sharing your opinions. If you have any questions, you can add them to the charts of this video and make sure to ask any questions that might be related to the topic or any other thing that you want to learn from the folks from Social Security. Now, before we dive into the questions, Eli Omri and Val, could you briefly introduce yourselves? Sure. Hi everyone. My name is Eli and platform team lead in Social Security here with me are Omri and Gal. I'll let them introduce themselves now. Go ahead, Gal. Take it. You're going to be a short one. My name is Gal. I'm running the DevOps operations at Social Security. And I'm Omri. I'm a platform engineer in the platform team at Social Security. Nice. Seems we have all the people powering all the machine in the background here. So I know assault is all about security. So but can you share with us more about your company and what you do and how you did all some work you do? Yes. So assault is about five years old. We started off with Kubernetes on the get go. Now we have around 40 microservices today in production with multiple clusters. I think Gal would be able to elaborate a little more. We're very like we use a lot of tools and products from around the CNCF landscape. We love the CNCF and the community. It's been helping us a lot lately. And recently, in the past few years, we've dived into the whole service mesh journey. We've written a great blog post that was recently published on the CNCF blog around our journey with the GRPC and service mesh and little balancing around Kubernetes. And a lot of fun stuff like that. We'll be happy to share with everyone. But yeah, it's been really great. Awesome. Okay. So can you like walk us through what's your infrastructure setup is like? And what are the those some resources and the gears that run everything in the background? Got it. You want to take that? So we are multicloud. We use several cloud providers. Mostly, most of our services are in AWS. We use Kubernetes for all of our microservices. We work in a CICD deployment typology. So we use the CICD platform of Codefresh to deploy all of our manifest. Basically, it's all about Kubernetes. The entire infrastructure is deployed with Terraform. So we are tightly binding to Terraform Kubernetes. Oh, awesome. Yeah. That's interesting. But when did you start this journey? Planet is definitely a long journey. When did you start and why? Well, basically, from the early days in social security, we've started to use Kubernetes. I mean, the simplicity, the flexibility, it allows us to manage and make sure our service is highly available. And we can maintain it and make sure to revert if needed to guarantee our customers the best service possible. Thank you. Awesome. Yeah. And you mentioned the other time about BlockWish route. I went through it and you need to describe how you use GRPC extensively. Can you shed more light about your usage of GRPC and any other tools in your arsenal that you use aside, collaborative tools that you use aside from GRPC? Yeah. Maybe you can tell us more about why we chose GRPC. Sure. Sure. So we were using, as familiar with all microservice architecture systems, services need to speak to each other. So you have many ways to do so. And lately, we were using direct communication between the services sometimes and sometimes we were using more of async kind of communication on top of message queues. So we used Kafka. We tried to evaluate a few other stuff. But in terms of synchronous communication, we were using the ACA framework in order to speak between services. And we had some issues with backwards compatibility. So while we were changing APIs between the services, sometimes we had to cope with API changes, breaking API changes. And then we started the process of finding alternatives. So one of the first things we knew existed and always wanted to kind of evaluate was GRPC. Since GRPC is using Protobuf to serialize the messages, it has a very nice ecosystem of making sure we don't introduce any kind of breaking API issues. So this led us to actually take this step and widen our knowledge with GRPC Protobuf. And then since GRPC is working on HTTP, too, we've had a problem there. Because as load balancing is very familiar to Kubernetes, Kubernetes does it for you using the service resource, for example. In HTTP, too, the story is a little different. The connections are sticky. And what made us realize that if we will not find a solution to this load balancing problem, we might have some scaling issues. And then we started researching around this one. And I think I'm going to pass it to Eli because this was a very interesting research he's done. Yeah, the joining around load balancing in Kubernetes was actually a very interesting one. We're looking for a way to solve that load balancing problem. And so there are a couple of tools available for doing that. We can use Proxies like Envoy. We realized there's an easy way to solve it with LinkerD. So LinkerD was one of the tools that was available for us. We tried it out. It was trivial. We ended up not only solving our load balancing issue, but we also gained a lot. It was very easy to maintain. But also we were able to just look at everything that is going on between the services in no way that was available to us before. We're able to see all the traffic between the services. We're able to see which excessive calls were being made from one service to another. So we've had one service like bombing another service with a request. We could see that live. Adding on top of that is the whole security topic, which we were just completely over looking. And today we have encryption going between our services. So specifically MTLS, our services are encrypted, the communications are encrypted end-to-end both sides. Also thanks to LinkerD. So that really changed the way that we work with our services. It integrates very well with Kubernetes. So like Gal mentioned, we have a multi-cloud setup. We have it both on AWS and Azure. It just works the same on both clouds, which really enables us to be very flexible in how we test it, we run it. And another interesting thing is that it's not only for production. So we found that it's very helpful when you're also developing. So before we reach production, we were also using LinkerD for example as a service mesh to just kind of find those problems ahead of time. But before we reach production, our developers can go on and see all the anomalies in traffic that you can see between the services and then kind of mitigate them even before we reach production. So it was really a great experience for us so far. Yeah, that's awesome. And before we continue with the rest of the question, I know as an end user community company, sharing stories about how you use this technology, it's not just hearing from the vendors. It's very crucial for the community because people want to learn from the lessons and the mistakes that others have made. Did you get to attend KubeCon, the last KubeCon that happened in NAA or the EU one? So unfortunately, we did not. We did not get the chance to attend. We are hoping to attend the next one in May, I think, in Spain. So we are hoping. Did you get the chance? Yeah, probably you didn't get the chance to submit CFP. Yeah, we did. We actually submitted the CFP. So we're hoping even to attend the speakers as well. Nice. I would definitely be looking towards some of the lessons you learned. I took my CKS recently. I passed. So it's always interesting to learn how others are exploring things around the cognitive ecosystem and also especially when it comes to security because with all the supply chain issues and everything happening, we definitely need to know more about how we will all secure our things. So now on the topic of security, what are the trends you are observing in the cognitive industry that you think should be taken seriously when it comes to security? Well, I think aside from service mesh alone, which is also gaining a lot of popularity. So the concept of, I think, chaos engineering is most interesting to us. The ability to kind of fail services and infrastructure on purpose and do it so easily, kind of like a built-in way of configuring things to fail is great for us. We've heard that most chaos was recently introduced as an incubator. So that, I think, traffic splits also and open telemetry, something we're very excited about. Kind of a unified way of dealing with all kinds of of tools that can gather metrics. I think maybe something else we can think of. Yeah, thanks. So we are taking a look at the OPA as well, along with some new feature of LinkerD that was introduced recently regarding policies between workloads in the cluster. So as we grow our cluster and have more workloads, we would have to make it organized in a way that what they call Kubernetes today is the OS of the cloud. So when you run so many stuff, so many workloads on Kubernetes, sometimes it tends to be quite a mess to manage. So we are taking a look into this in the security aspect especially to make sure that only the workloads who need to actually be able to speak to each other can do so. In addition to storm tools, well, all of us still have a little back hurt from the recent log4j problems, right? Yeah. We know how you had many long nights. So now how we implemented a very, very cool way of being ahead of this. So next time the log4j and the next log4j is published, we will probably know it before others will know it. We have a very, very intense system that scans our containers before they ever leave our dev machines and we are using some open source utilities along with some vendors to help us do so. So tools like GRIP and SNIC, for example, and other tools from Encore that we are evaluating as well, which is super important. There are two main vectors here in addition to not be able to push and infect the image. Sometimes like we saw with log4j, the image was already there deployed on the production of probably everyone. And then the zero there or the vulnerability was found when the image was already there. So it's a very important thing to note that sometimes you will not introduce the vulnerability when you add something new, but many times you will find a vulnerability in something you already have deployed onto your customer's production environments. And it's very important to be able to monitor this. Yeah, also log4j spoiled lots of holiday plans. So no one wanted to repeat itself again. Sorry, Eli. Yeah, I thought also Gal wanted to mention. Yeah, not only like as our scale grows, so we are looking for the most efficient ways to manage our clusters and secure as well. So we are getting into the GitOps topologies. We are implementing GitOps for us to securely manage and have the least privileges way of managing our remote clusters. So that's something we're also sorry implementing. Yeah, awesome. Yeah, I think for my next question, you've already addressed most of it when Omri mentioned log4j and also for the sake of the listeners who might not be aware, Omri mentioned OPPA, it means Open Policy Agents, right? That's what you were referring to Omri. Yeah, so awesome. Yeah, my next question, I think you've covered most of it already. 2020 was a huge challenge for everyone in the industry. You already mentioned log4j, there was SolarWinds, there was quite a lot of supply chain security issues and other things that were happening. As a company, certainly, probably not for your clients, what other challenges was thrown 2021 or 2022 at you that really shook the company but you were able to scale through it? So I think being a security company puts us a little bit on a very high bar, really can't allow any, you know, falls there. But aside from security, I think our biggest challenge today is around scale. So being able to kind of scale fast and doing that and with the minimal team. So, you know, God has a fairly small team which does pretty much everything. We're looking at adding other SREs, we're really looking at observability and reliability as the kind of big, too big things that are ahead of us. So being able to leverage kind of technologies that would allow us to, first of all, sleep better, you know, be able to kind of be on top of those things before they happen. These are kind of things that we're looking into. But also, when we're deploying things, we'd like to be able to kind of test those things out before they reach production. So canary deployments is really that the next big thing for us in terms of deployment. Integrating that into our GitOps and CI CD environment and, you know, kind of pushing a new version of some kind of deployment and splitting the traffic, you know, trying it a little bit, seeing if it works and rolling back automatically really do a lot for us. I think those are kind of the big topics that are ahead. Yeah, awesome. Definitely getting better sleep is a major motivation for almost all of us, especially when the industry always keeps throwing new things. When you think you are done with one, I don't know when it comes up. Yeah, right. Definitely. Yeah. So I think my next question goes to you, Al. You already mentioned that you do multi-cloud. Do you also do multi-tenants? How do you distribute the workloads between your clusters? Since you have multi-cloud already, and what challenges does this, does using multi-cloud bring to you? And multi-tenants, if you do multi-tenants? We don't do multi-tenants at the moment. That's most central to something we are heading towards. I think the most challenging thing about running like a multi-cloud is managing it not only in terms of deployments, but having the right scale, having the right resources. Your monitoring now needs to support multiple clusters. So instead of a single child, you now have a bunch of child you need to take care of. But with tools of observability, like Datadog, we use Yeager as well. We find ourselves managing it the right way, instead of running after each cluster, making sure it has the right resources. It operates normally. We usually set our monitoring to notify us when something goes bad and learn from past incidents. So we'll be more reactive than running after managing each cloud provider. And I think another challenge is cloud provider issues. I mean, it's not that rare, but cloud providers does have issues. So that can also affect your operations. But you need to be prepared to shift users from one cloud to another and have all of these automations that allows you to stay highly available and route traffic between clouds to support the load or any chances of incidents at the cloud provider level. Yeah, definitely. I want to go after this. Sure, let me. Go ahead. Sorry. All right. So like as a message to the community, everyone have, or at least most people should have redundancies. Like if we are mostly deployed in our example in AWS and then we have another cloud provider as the resiliency provider in case something goes wrong, test it. Okay, it's not enough to have it ready for the day, right? We need to test it. Because if you don't test it on the day you need it, like it might not work. So this resembles with what Ellie said before we are going to introduce in 2022 of chaos engineering. This can be one of the kind of things we will want to test as fast as possible automatically, not only having a way to do it manually, but automatically. It happens a lot. Services go down. It's normal. This is the way the cloud acts, right? When things get super complicated they tend to break sometimes. So yeah, so test your emergency protocols. Life sometimes can be chaos engineering, you know. Yeah. Yeah, I think even if we don't let anything from the pandemic that's happened, one thing we should learn is things can go wrong and they can really go wrong really bad. So based on what Omri said, everything needs to be tested to make sure that we are ready. Things can go wrong at any time. And that was mentioning about using multi-cloud admission and short desk redundancy. I think some of the provider issues we've seen with the service provider that data center points down to another one that kept having downtime almost regularly. No matter how big the service provider is, things can go wrong. So you've been able to prepare is a very crucial thing that as a company one definitely needs to make sure that in case things go wrong, there's always something that to keep the service running because nobody wants to lose money when things go wrong. Yeah. If I can add, sometimes not being hosted on a certain cloud provider doesn't necessarily mean one of your services is not really does not rely on that cloud provider. For example, we've seen in the AWS incident that Quay Docker repository did had some issues. So yeah, so be prepared for that as well. If you're having a service, keep in mind, you'll always have to have a backup plan for that service as well. So because which means your provider's provider can go wrong. If you saw that you was wrong with your provider. Yeah, totally. Awesome. Yeah, this is really interesting. My next question I think is still for Gal. I know definitely you have a lot of automations and other things happening. How do you manage cost cost automations, things around upgrades, versioning, testing, rolling out new features, etc? It's mostly I think a cooperation between the DevOps and the developers. I mean, we need to have the right pipelines and have the right capabilities to allow developers to add these tests before they reach, before they reach production. I think it's not only in the phase of the testing, but also once deployed to production, I think there's a challenge to monitor it properly. Once the feature is released, monitoring or having the right visibility on the feature will allow you to better understand how it behaves throughout time. I mean, it doesn't really mean that the moment you release a feature, it will work properly for good. You need to have proper visibility and monitoring throughout time to detect future fixes, issues. So it's mostly a cooperation between the developers and DevOps to make this platform. Yeah, that's awesome. But in all of this that we've been talking about, what do you think is next in terms of cloud native and social security? Well, I think so far the tools that we've learned that are available are really, there's a lot of abundance of utilities and products. Every day we go on CNCF, we see the kind of landscape, we see new and new tools, we have a backlog, things that we want to go through. So far, I think it's something that we can't cover, like so many projects. We have a lot of plans and not enough time to kind of try things out. I think the tools that are available today are kind of filling up our backlog and we have things to look at. I think mostly around Kubernetes, things that are kind of supplements to Kubernetes and monitoring are those. I can't think of specific projects that we haven't listed, we haven't spoken about that we're looking into, but all those are plans forward. And I think those are kind of the things that we're looking into. So yeah. Yeah, awesome. Sorry, I clicked on your own thing on my system. Yeah, awesome. And I think, yeah, I think you and me work on the platform, right? So I think you'll be able to talk more about developer experience, you know, with introducing all these new technologies and bringing in all these upgrades and new improvements to your cluster, your developers need to use it. How do they interact with your clusters and how do you manage the development life cycle and maintenance and troubleshooting with them? Yeah. So, yeah, it's a really, really big thing where we are trying our best to allow our developers to not only be able to use this stuff, but also to understand this stuff. Because when something breaks in a company like Salt, we are recruiting all the time, and we are scaling not only in our workloads, but also in our human resources. So you would want to be able to introduce new developers to the system. And also when you introduce a new technology, you would want your existing developers to be able to use it properly. But not less important, like I said, to understand it, because if they don't understand it, then we will be the bottleneck in solving everything, because everyone will come to us. So our development environment, since it became quite a challenge to run it on a local dev computer, we are using Kubernetes to test our stuff. Also, if you remember, one of the 12-factor app recommendations or standards is to make the developer work on an environment that is as close as possible to production. So this is how we do it. This cluster or clusters, we use telepresence to connect to them. So telepresence is a really cool project, and their latest version is very nice. We are using it extensively. And this way, when a developer is working on one of our many services, he can instrument only this service on his local machine and have this service talk to the rest of the cluster as if he was sitting inside the cluster. So this is a project we've graduated a while ago. People are using it. They are loving it. And it also helps us to save on some costs. Because rather than having one giant cluster that the developers are fighting over, we have many small namespaces for which we pay less. And the developers are not waiting in line to test their stuff on the cluster. Right. So when a developer wants to test something, he can set up a whole cluster just for himself. And a whole environment will be spun up. And then that developer can test his work on that cluster. And also even debug one of those services on his local machine by routing the traffic into his laptop. And also resolving DNS request as if, like we said, as if he was in the cluster. So one of the services can dock directly to the pods in the communities, which gives you a feeling that you're inside the cluster. You can actually develop it inside it, which really solves a lot of problems we had before. That's really awesome. And yeah, it's been interesting, some of your experiences and some of the insights you've been sharing. I know definitely our community will appreciate its kind of content. But as an end user community, the CNCF landscape is like this landmine of new things always coming up and new technology. Sometimes if you go there to be, at the moment you go back, you're like, ooh, this thing has expanded. There are new things here. So what's your experience like as an end user company in the cloud ecosystem? Can you share more about your experience as probably individuals, practitioners within the ecosystem and as your company? Yeah. So as you said, the CNCF landscape is an awesome website. We learn a lot by pulling it like every few days. And so far the community was amazing. We didn't have any project that we needed some community involvement and we didn't find the people or the people were not responsive. So we really think that CNCF made a great community which is supportive and very fun to be involved in. So on our experience, like for example, the process we've had with the LinkRD community, which was very fun, we had some questions, we approached them in their Slack. And we developed some kind of a really warm and nice relationship in which we are hosted in their community calls and talking about the features we use and educating other people of the community. So yeah, this was our experience. It was very nice. Yeah, that's really awesome. You know, one of the great things about the climate community is you're always welcome and you get that welcome feeling where you come even as someone that's completely new to the ecosystem. It's easy for you to learn new things and also learn from the experiences of others. Yeah. And I also noticed something in your background, Omri, I guess that was fun and probably a good one. Yes. I was one of them is actually my girlfriend. She did it. Nice. Yeah. And this golfer is from Gopher Contella Vives right before COVID hit. It was last moment. So yeah, I love golang. Awesome. Ellie, you want to say something? Yeah, I'm saying go golfer is, you know, we, yeah, we love gol. I love that. Don't specifically, I think it's awesome. Yeah. Nice. Yeah. So I think I'm exhausted much of the question that I have. Is there anything else you want to shed more light on or share with the community? I think we mentioned it, but I think a service mesh interface is something that we really, really want to see evolve in the coming months. We really hope, you know, these are kind of pushed forward, you know, with Istio and LinkerD and all, you know, dashboards like Keali. So that's something we really would like to contribute and see how it evolves. That's kind of the big thing that we're hoping to see. And I think that's it. Most of it. Yeah, awesome. Thank you very much, Ellie and Omri and Gau also. Thank you very much for joining us today. And it's really been an enlightening session. I personally like it a lot. I've been diving more into Kubernetes security lately and hearing some of the challenges and lessons you've learned. It's kind of resonating with a lot of things I've been delving into and awesome to see people doing most of it in real life. Now, thank you very much, everyone, for joining the latest episode of the Cloud Native User Lounge. It was great to have Tim from Salt Security talking to us about the usage of Cloud Native ecosystem and the security landscape of Cloud Native as a whole. We also really loved the flow in which the questions are some of the awesome new information that is shared with us. We bring you the latest Cloud Native end user stories on the fourth day of the month at 9 a.m. Pacific time. Don't forget to also join us for KubeCon Cloud Native Con EU in May. Hopefully, you'll probably hear the Tim from Salt Security sharing more about their experiences at the conference. Hopefully, we are able to meet in Spain. Definitely, we're going to Spain. Maybe at least, I guess, a travel to a conference in real life. Yeah, exactly. And you also hear a lot of latest things from the Cloud Native community. And if you also like to showcase your usage of the Cloud Native tools as an end user, join the end user community with more details on cncr.io slash end user. Thank you very much for joining us today and see you next time. Thanks for having us. Thank you. It was great.