 Hi, welcome to keep calm and sincere this session is pre-recorded and we are speaking to you from the past. Our presence as we will be watching the chat and please feel free to ask any questions. We will talk about dread hunting gets killed by using Cubane's audit logs and kernel events. And in this journey, we are not going to dip into each technology that we utilize for our purpose, but it's more than that. We want to tell our user story to show the way how we did it and demonstrate how everything works in harmony at our production. Let's get started. I am Furkan Turkal and I am maintaining front-action level clusters as a platform engineer. Also, I am contributing to various CNC projects and open source projects. Hello, my name is Imin. I have been working on containers and the technologies around containers for the last two years. I love working on open source projects and I love watching car videos. Okay, before we get into it, we want to introduce ourselves. Thank you. Thank you. It's a tech company that did the focus on e-commerce in Turkey and creates a positive impact on our country and the ecosystem. At TranDiol, we are running thousands of Kubernetes, Elastic Search and SQL databases and much more. We are highly motivated to include open source project into our systems and be part of open source community and projects. We are managing our infrastructure on-prem that is distributed across seven different data centers running H8 Kubernetes clusters. This is going to be our very first CNC presentation as TranDiol. So as you see here, this is our infrastructure matrix. You can monitor our metrics at real time by just entering informatrix.trandiol.com. And before we get into the presentation, we want to take our platform team for their great work and send special greetings to our squad for their awesome teamwork. Okay, who is this session for? If you are curious about runtime security and threat hunting at production and want to know how you can monitor entire system, this session is definitely for you. We will learn how we detect threats, automate the process of collecting and analyzing all the logs to monitor our potential security threats. Then through age, these threats with appropriate actions. Moreover, this session is the easiest investigating these species out this in your infrastructure and create a baseline for monitoring and alergy. Best of all, it's all done with open source projects. Okay, in the session, we'll talk about the threats and our security pipeline. Related to that, we'll explain the other logs, the runtime security, the log processing and the monitoring. Besides, we will reproduce our mindset in a demo at the end of the session. Furthermore, we want to share some ideas for cluster security monitoring. Okay, what is threat? COVID-19 is a threat to humans. Of course, everyone knows what threat is. Before we started, we want to simplify our subject here with a well known example. Nowadays, COVID-19 is losing its effect, hopefully, but we are still suffering from COVID-19. The sickness is simply a threat to us. When you become ill, you'll start showing some symptoms like high fever or cough. So it's like basically the threat is coming from outside. And our body is monitoring itself and starts alerting for us to take action when you got ill. We're going to follow the same basic for our COVID-19 environments. Threats. We should focus on our defense to minimize the risk. We have to know what we are surrounded by. Like here, we have a bunch of examples, period containers, reading credentials, accessing the file systems. So there will be always actors to intimidate your environment, and you have to be aware of the message application and user to protect your environment. So first of all, you need to define your trust boundaries here. Our trust boundaries basically are clusters, nodes, and of course containers. The main key point here is to create a focused defense area that we should create fine-tuned rules and alert mechanisms. Especially if you work on a high scale, you need to label your logs to mark where the data is sourced. If we are able to find replicable patterns to detect threats by person logs, we can easily optimize our hunting. Here, our secret pipeline, and we call it our secret pipeline at runtime basically. So we're going to cover our threats hunting topic in four different headlines. Each headline will introduce open source projects and will show our configuration and end of the session, like we said, will use those applications in a demo. If you're not so familiar with these technologies, don't worry, we'll explain everything in a brief introduction to each one. So here, it's time for Furkan to explain all this and Falco. Thank you for a great introduction. And here are audit logs. System visibility into events in the container and system give great insight into what an attacker or malicious user is trying to do. We can define our audits as the answer to these questions, such as when did it happen, what happened, where it's going and who initiated it, and so on. Which are defined in the Kubernetes documentation very well. Kubernetes audit records the actions and provide an audit trail of users and workloads, which I would call the Kubernetes API. Kernel level auditing allow us to monitor system calls, file accesses, and much more things. Eventually, we obtain the information about what's happening in the kube cluster. Kubernetes audit level events for other end, every action that we do when we are interacting with the cluster, using kube CTL or any other things go through the Kubernetes API server. What Kubernetes audit does is provides chronological set of records that document all the changes to the cluster. And each audit event is actually JSON object that includes event time, response status, will initiate it, IP addresses, user agents, and you name it, all that kind of information. You can either active before you set up the cluster or you can access the master not to activate the Kubernetes audit flag, but you have to restart the Kubernetes API server eventually. And Kubernetes audit has support to be cans and one for what we can and another one, we put back and we can simply send events directly to the upstream server. While what we can is stored in the file system for our case, we are going to use block backend and you will understand why in the next slides. Okay, what should be recorded is you can only see in simple policy configuration that has two different rules. We might listing the ports, for example, doesn't give you any information secret right getting a config map reading port status whether it's ready or not. And this features actual policy provides you to minimize the verbosity like changing the log level or eliminating unnecessary logs. And for example, Kubernetes logs that not much critical for different kind of resources. Falco is a runtime security engine. It's open source security tool for continuous disk and threat detection across Kubernetes containers and the cloud. And Falco detects unexpected behavior configurations, changes, instructions and data theft in real time. Falco also has support for Kubernetes audit and kernel events and uses has different methods to collect the data, which is kernel module and ebp approach. Here is the architecture diagram of Falco. The diagram explains pretty much everything itself. When Falco engine starts, it simply load the rules. The engine then waits for the events entered by Libscape and Libscape components. Basically, Libscape is responsible for collecting data while Libscape is responsible for enriching the data. And Falco can collect system events using the kernel module or ebp approach. You can only choose one active and also supports Kubernetes audit and all you have to do is enable the web server and send the events to it using full-end bit. In summary, it simply takes the events, measures them with the preload rules, and then when the event measures one of the rules, Falco simply sends it to one of the outputs, which is file, gRPC, your book, and you can also use Falco sidekick, Falco client go, and Falco exporter. Okay, Falco configuration. To be re-managing our Falco using the hand chart and offer an information feature is enabled by default. To change the value, you can use overwrite this value, and also it's possible to add new rules or overwrite the rule files, as you can already see here, and these configurations are just typical diceback. For the following configuration, now we are ready to scan Kubernetes audit logs. Falco rules. Falco has ability to extend the rules with the conditions and macros for more flexibility. It also ships many rules by default, and these rules are split by each own context. For example, Postgres, Kubernetes, Elasticsearch, etcd, and you, for example, you can modify consulate specific conditions or you can add your custom logic with macros with the changing the rule itself. Okay, here's an example of what the Falco rule looks like. We are using this use case for detecting a privilege container that launches in the cluster. It's a simple logical expression and combined with some conditions when first time when you deploy Falco with default rules, which covers many aspects for a variety of situations. However, this might cause some noise in your environment. This noise might mislead you if you are at a high scale and throws many false positives. Falco provides a way to overwrite the rules. In this example, on the left side, a rule applies for privilege containers, but you might have some custom privilege containers and this could fire false positive highlights. In this case, we are writing two aspects here. The first one is macro where you can define logic here, and this macro requires an allow this for privilege containers. In two steps, we have reduced the false positive situations. Falco also supports rule exceptions and exceptions are defined as a part of the rule. The file property contains one of the more files that will extract a value from the audit events. The comms property, which is you can use contains equals greater than less than completion operators and that align with the items in the file property. Each item in the table should align one to one with the corresponding file and completion operator. After we tune the rules by enabling the exceptions and for each rule, we finally got rid of a bunch of false positives as you can already see here. Okay, it's your turn. Thanks for coming. So he already mentioned about the Kubernetes logs, how we write them into a file. So he already talked about the Falco. So we are collecting logs from the kernel takes the Falcos, the kernel module or you can choose the ppf prof. So let's talk about how we collected logs and how we send our looks Falco. Flap it is a general purpose log processor. It's also has metrics collection capabilities for embedded Linux systems can run the flat bit in any environment such VM embedded devices bare metal. And for our case we're going to run in Kubernetes classes. In our scenario flat bit is going to be deployed but as demon said since we want to deploy on each node and we want to collect every log in, in the node. We decided to go with flat bit or flat because it is designed to run a high scale with low resources. And it is one of the efficient solution for containerized environments pipeline. Let's see inside of the box. This is the data pipeline, which is called in the documentation. Let's start with input input collects the data in many different ways. For example, you can use kernel plugin to collect kernel logs. There are other plugins to collect logs. Eventually, you'll have a lot of data in your hands but you need to structure that data. So parser comes into play here. You can convert your unstructured data into structured data. For instance, your application has a unique log pattern you can define that pattern and the parser. You can apply it to any log. So, here we have filter. Filter fixes are one of our challenges here. Filter basically enriches and modifies the log records. Buffer refers to the ability to store the records in memory or optionally in the file system until your output is delivered. You can send the data on our multiple destinations, take and match permits are critical here. The grid part is always at the end. You can send the data to a service or write a file. This can be limited as plugins like you can define elastic search or log as endpoints. We'll see in the next slides. Conferring the flat bit based on our needs. As you can see here, this is how we collect data for Kubernetes API logs actually. We use here tail plugin. What you find here is an internal sitting that is used in a later stage by router to decide which built which output phase it must go through. It is worth mentioning that we use Jason parser here you can always find correct values for buffer configuration. In the next slide. Here the key part is to collect out from different location with identity. In addition to the Kubernetes audit event itself, we have all this additional information about where the audit is coming from, like code name container name, and you can add another custom label here. If you have millions of given this class, you want to know who produced this log is where hours are set automatically by our CI CD pipeline, since we store all the flat bits configuration as a code. And fortunately, there's a building pipe named modify plugin that does this job on the collected data. So basically, after we give identity for each event, now they can be pushed in any remote host. So that finally, we can see the out of events and do threat analysis on them. Okay, why flat bit with log backend or Kubernetes backend. You might want to ask that question before you're asking that question you want to answer. As you can see low processor is very important, but also it's challenging to configure to make it production ready by using flat bit we are able to modify the event and loads according to our business requirements. Furthermore, we can easily reconfigure our output plugins and start a rollout to entire cluster by clicking a single button in the pipeline, which is our CI CD pipeline. With flat bit we can send out logs to anywhere in this case we see we are simply sending out how it looks to different remote one is Falco the other one are is the other one is our indexes storage. In the next slide, we'll talk about monitoring. It's actually how we increased the visibility and how we enable the observing the performance of security. Although there are other options around we'll talk about how we used Loki due to its similarity prometheus and it has can easily implement with the graph on this is another part actually it's very common, convenient and cost efficient to configure and easily create a graph on a panel with it. Okay, look query. First we need to organize the Falco logs with Loki, we are going to create a log pipeline, which is being executing in left right seconds for each log line. Now our logs looks like in graphon in you as you see in as you see here in the picture. It is very great to have the logs in Jason format. It's going to make our job very easy, you'll see it in the demo and will show it in the next slice slides. But unfortunately, this output is not applicable for panels for now, however, you can easily filter in the logs here. On the right, we are showing how you can organize your logs in the left hand side above your scenes. We are using Jason parser here. So that Jason parser is basically can convert Jason Jason format and logs and extract the logs labels. Those log labels can be usable in the metric queries. A little bit below as you see the vector field is just detected by Grafana, but you can use it in the Grafana UI but you cannot use them in the metric query. Let's work on let's work on the part log part that is also is in Jason format. The value is also Jason format. As you see here, we're going to use in the middle. We're going to use the line format expression to select the part then we'll use the Jason parser again to extract labels on the right hand side you see all the labels and those labels can be usable in the metric query. In the next slides will apply a simple metric query. A log query has it all metric queries that can be applied to low queries, you can create panels based on number of entries per second, or number of buys per second. So here is an example how a metric query and a log query is combined count over time shows the number of entries for each log within the given range with combination of some query we can count long entries based on anything. As you see here in the example we are used rules we use the cluster names. Okay, some time for Furkan to talk about challenges and some more metrics. And before explaining flat bit and global for us and challenges at high scale. And actually everything seems easy in the first place right, especially if you're working on high scale. There might be some changes that are worth considering, because think might not work as expected for you. There might be various changes in the process such as eliminating false positives or writing the rules per team per project and making this fine tuned is really time consuming. And we are generating tremendous amount of audits every second. And meantime, we need to find a way to build efficient indexing h8 backend storage, which is highly available and data storage since our data grows every second. We also would like to thank our teams to make it a highly available index weekends, and we are surviving this such a tremendous amount of that does. What happens in seconds at random? We are monitoring and collecting audits from thousands of workloads. In the normal days, approximately more than 300k audits is being generated each second. That's really huge, right? And for per minute, more than 400k Kubernetes audits events is being generated and 8k of these logs are being scanned by Falco using flat bit plugin. This is just what happens in a second for our Kubernetes clusters. Okay, we created some dashboards for to show about overall events, and these dashboards updates in real time, and we are able to select time range. And this one is specially designed for Falco audits, and we group by rule group by namespace, clusters, clusters and new names. We also actively monitoring the users will gain access to running Q&A spots in the production environment. Another utmost importantly, we also monitor the users who read secure objects, for example, secret resource or our custom secret resource objects in Kubernetes that store sensitive data that's used by production services or applications. Here's our weekly report, we also getting weekly digest for every event we are actively watching. This is simple security for pod-exec detection. Okay, let's put all the mindset into demo. I mean, here you go. Thank you for coming. So, we'll put the pieces of the puzzle together in here. Before we start the demo, we actually recorded demo for this recorded video. We didn't want to take much time with the demo to reproduce a similar experience for creating an environment on our workstation with MiniQip. Also, we published the demo on GitHub as you see the link in the below. Okay, can you please start the demo? Okay, start the demo. Thank you. So, in the beginning of our demo, we are creating multi-node cluster above we have cluster one and below we have cluster two. When the cluster get ready, we'll edit QBAPI servers for getting the loss into the host machine. Since the QBAPI server is a static pod, it will restart itself and we save the changes. When it's done, let's start collecting the logs with Flabbit. Since we already mentioned the hand values earlier, we will just look at them quickly here and we will modify the filter to add custom class label and install the hand chart. We'll use hand chart here and you will see it's very easy to install all of them. Okay. Next, we'll install Falco. We basically use the default values here. You don't need to configure much. It's the configure itself with the default rules. So, Falco posts are getting up. Last piece of the puzzle is Loki for monitoring. We install it along with Grafana, which it will configure by itself. Since we are using the Loki stack hand chart here, but we are going to just install it in the one of the cluster. The second cluster will define service and endpoint. This way we'll send our logs to Loki easily. Since we defined our output host as a service name and the port number. Let's access the Grafana. Let's see how logs are coming. Of course, they are coming. We already recorded this video. Okay. We're going to just look at the Falco's look. And as you see here, we're getting the first clusters from the first cluster and from second cluster. We're going to organize our logs. As you see here, we don't have much labels for now, but look label has a JSON format. So, we're going to select that part and apply the JSON parser again and end of it will have a bunch of labels. Okay, time to aim. As you see here, we have lots of labels here. Now, finally, we're going to use count over time metric query, which is special metric query for local. I have a result, but it's going to be messy. Since every log is unique. So let's use the rule label to combine everything together. So everything makes sense right now. So we're changing with the cluster. So we're seeing the number of logs based on cluster. Thank you for coming. For demo, I get really enjoyed for this. And as you mentioned, just before the demo, if you want to reproduce the same mindset, you can clone the repository here. And we also add a bonus section section just for you and we drop some ideas here such as response engine, police engine, SOC team and on call alerts. And the idea here by using by enabling response engine is that you can take actions against very specific events. For example, handling container drift caused by QPSTL exec, etc. or any other interactive Kubernetes requests inside the Kubernetes cluster. You can find really good examples on the Falcos blog. With the police engine, you can find to your system by enabling custom specific polishes that will prevent malicious action, detect by the runtime at the very beginning. For example, preventing the deploying privilege container. And also, you can direct your audit logs to secret operational center so they can analyze and monitor and also take an action for the against the event all the time. Unfortunately, you can also define specific alerts for emergency call. For example, getting a secret or reading a secret in the production services such as by payment API would be an emergency. And the list can goes like on this. And thank you for joining and listening us in this session. Thank you for your time.