 Hello everyone. My name is Isu Kim and I would like to talk about enhancing observability in cloud-based educational Linux lab environment. And since this is my first time speaking in front of such amazing people, I'm very nervous right now, but I'll just try my best. So, before I start, I would like to quickly introduce about me and my lab. As I have mentioned earlier, my name is Isu Kim and I'm from Daegu University located in South Korea. And I'm currently an undergraduate student in Mobile OS lab and currently my research topics lies on cloud observabilities. To be more specific, I'm interested in systems monitoring and distributed tracing as well. A quick introduction about our lab. As our lab's name indicates, we use to focus on Android operating systems such as mobile operating systems such as Android. However, in recent days, we have been expanding our research topics to following things. So that was pretty much about me and our lab, so let's get started. So, the first topic is that I would like to discuss is the need of cloud for education. In this section, I'm going to talk about why we wanted to build a cloud for education ourselves. So, in our university, we have various coursework from systems programming, operating systems, network programming, which is just a socket programming. And to modifying and compiling Android OSes. And all of those involved in using Linux lab environments. To be more specific, we need lots of physical or virtual computers with Linux environments for students. To provide such a large number of computers, we have tried some methods as listed below. And from the next slide, I'm going to talk about pros and cons of each approach. First approach is using single board computers such as Raspberry Pis. And they are extremely good because we are handing out a physical Linux machine so students can have a hands-on experience from installing to doing hardware actions. Without restrictions. However, in recent days, since Raspberry Pis price is kind of skyrocketing and providing some extra accessories to students are quite expensive. That was one problem. And keeping track of physical devices are quite demanding for human efforts and times as well. So, we have tried the next method which is telling and making students install virtual machines in their environments. The main advantage of using virtual machines is that we do not have to provide anything. So, that means there is no cost for us. And the second advantage of using a virtual machine is that students can recover from their failures or mistakes easily by using snapshots. However, since the virtual machines heavily depend on the students' computers or laptops, there are some students who cannot afford those luxuries. Meaning that some students can barely meet those hardware requirements and those inequalities are not ideal for our environment since we are teaching students how to use them. So, the last thing that we have tried is using a public cloud. And we have been renting some public cloud instances and letting students use them. The good thing about using the public cloud is that they provide easy management over lots of machines like hundreds of students at once. And the second thing is that they provide public IP addresses. That means that students can host like a service or perform like some network actions and that's not limited. However, the problem is that they are a lot pricey compared to other solutions. And since we are paying money over time but we are not getting anything out of it, it's not a good solution in the long term. So, with those backgrounds, we have thought about building a cloud for education for ourselves. As we were trying to build a cloud system ourselves, the basic requirements were shared as the clouds that would be normally in the production environment. Such as, like we also require extensibility, flexibility, scalability and security and so on. However, since we were building our cloud on top of our university's IT infrastructure, we had some more requirements than those in the production environment. The first thing is that we need a virtualized Linux environment per student. And since we were not able to provide each student a physical server or a physical machine, all students have to share a limited number of hardware resources. As you can see in the picture, my left, your right, we got just one rack of server with 10 servers, so that's kind of limited. Also, since we were sharing those hardware resources, that meant we need isolation between students. Also, since we are required to manage lots of students, like 400 plus students at once, like to register and delete those accounts, we need simple management tools. And the second thing is that we needed controlled entry points into the system. In our university, we have been assigned with a number of available public IP addresses. That means that we were unable to provide each student a public IP for their accesses. Besides that, we had extra limitation and school policy for security concerns, which specifically meant that we were now allowed to open any ports to students. For example, if we were to open some specific port like 25565 TCP, we had to get permission from our IT department. So we required a limited and controlled entry points into the cloud system. And last but not least, we were required to have an enhanced observability within the cloud system. Since our resource pool was not enough, we had to take the monitoring into the shared resource utilization. This also required us to monitor each student's detailed activities within their virtualized environments. And I'm going to discuss in the later sections, because this is the most important part. So with those requirements, we have created an overview of our cloud infrastructure and architecture. And that consists of SSH Gateway, HTTPS Ingresses for students access, and management tools for admins to create and delete those virtual machines or containers. As well as an observability tool for each component, including the physical infrastructure. From now on, we are going to talk about the first requirement, which was providing virtualized environments per student. As we were running on self-hosted and bare metal environments, the first option that we came into our mind was OpenStack. OpenStack is great, but it has lots of features. However, those luxury features were kind of over-spec for our requirements. And to support those features, they used a significant amount of resources which were kind of one of our concerns. Also, to be more specific, their complex network topologies were quite challenging for us to properly set up and manage. And since OpenStack is running inside a virtual machine, the observability choices were quite limited as well. Meanwhile, Kubernetes offered basic and simple building blocks such as pods and deployments and persistent volumes. It was especially helpful since it was very easy to set up and manage the whole cluster. And compared to OpenStack, Kubernetes tend to have a lightweight resource usage as well. And since Kubernetes was running on top of containerized environments, they offer a wide range of observability choices. So for our requirement, Kubernetes was better suited for our job. So with those Kubernetes components, as you can see in the picture, we were able to provide a virtualized Linux environment which runs on SSH, Visual Studio Code, and Jupyter Notebook Server as a service. Also, in Kubernetes, those pods migrate across servers. And to preserve our user data inside those containers, such as home directory of a user, we have utilized a persistent volume which is attached to an NFS for a storage backend. Also, lastly, we have created a simple tool which can deploy all those required Kubernetes resources using Kubernetes API as well. So with this, the student was able to have a SSH environment and some online IDE environments. And the second part is controlled entry points. This part is quite straightforward. And the first thing that we had to offer was a single SSH gateway which forwards SSH connections to virtualized environments by users' names. For example, as seen in the picture, Alice and Bob both connect to the same entry points, which is dongle.cloud, but with different user names. However, the SSH connections should be sent to different services within the Kubernetes cluster. For this, we have set up a proxy and a load balancer for exposing IP addresses inside our private network, which is our university network. Also, since we had lots of SSH login attempts, which is for hacking, we also required a firewall service as well. And that blocks malicious accesses to the cloud as well. So that was the first part. And the second part is an HTTPS ingress. Since we were going to offer some visual studio code service, which is like a GitHub online editor and a Jupyter notebook service, just like a Google's Colab, we had to have a HTTPS ingress which exposes our services into our students. However, this part is quite conventional, and this is used in most of the production environments. So we are not going to talk too much deeply about this. In short, we were utilizing reverse proxies from Nginx for our ingresses. So that was pretty much our overall architecture for the environment. And by using those two components, we were able to build a cloud for education. Let's just watch a simple demonstration video. I'll just play. So as you can see in the left, as you can see in the right, those are different usernames. The first one ends with JP00, and the second thing ends with 01. And they will both connect to the cloud, and they will be first asked to change the passwords by force. And that's quite long, so I'll just skip it. So the second user will log in as well, and we offer online web services as well. The visual studio code, which works like a GitHub code. That was the first one. And the second user will use a Jupyter notebook online service, just like a Google's Colab. Yes, there was pretty much all of the features within our cloud environments. Yes. So the third thing that was required was enhanced observabilities, which is a big topic for us. And from now on, I'm going to tell why observability within our cloud environment was the most important thing. So the first question is that why do we need observability within our cloud environment? The first reason is that we are using a limited resources. This is quite straightforward, because production environment as well will do such monitoring to optimize their hardware or some other resources. However, the second reason is quite characteristic, I guess. And since we are mainly targeting our cloud services for students, and students are in a learning stage, meaning that students are pretty much amateur administrators. So I'll just provide some motivating examples and sharing some cases which happened in the production environment for us as well. The first scenario is that students are burning up operating system resources. For example, student A was burning too much or too many processes in the server. We teach multi-processing and multi-threading in some classes, such as operating systems and network programming. However, students are not quite familiar with the concept of a process and threats, which they have to program. So they have difficulty in managing specifically the child processes and threats as well. These problems eventually lead to students forgetting to kill child processes and make a lot of zombie processes. Due to this, the operating systems resources are being used too much. Students will use some system calls such as fork or clone to create a process or a threat. And this affects the host since the containers are sharing the kernel with the host as well. And the second thing is there are limited number of processes that can be spawned in a machine, and that is just an example of the first reason as well. The picture in the right is an image that during our operating systems course, lots of students fail to manage those child processes effectively. So we got hundreds of processes and taking almost 97% of our CPU usages. So that was the first reason and the second reason is that they are burning up hardware resources. There was a student B who tried to find a hash collision by writing random hashes into a file until it collides. So he was quite unlucky and he had to write over 90 plus gigabytes to find a hash collision. So in this case, there are two problems within this scenario. The first one is that the students or the hardware resources specifically are using CPU resources. For example, if we were to just create an infinite loop without any system calls or other things, it would take lots of CPU time as well. Also, combined with the first scenario, let's just imagine what will happen if we have unmanaged multi-threads and unmanaged multi-processing connected to an infinite loop. And yes, this can be easily solved by using some features in Kubernetes such as limits from requests. However, the next problem is quite difficult to deal with. The next problem is IO resources. Since we were using networked storage, which is an NFS server, there were limited bandwidth and lots of traffic being shared. Meaning that we have to share traffic between NFS server and a physical server that's running the pod and the capabilities of the then switch as well. And also, we can just create a limit of size that can be written to a persistent volume using Kubernetes persistent volume specs. However, those does not avoid from users taking up lots of IO resources. Meaning that some would just write and delete and write and delete and delete to a single file. That's just counted as just one file and that's not being controlled at all. And also the security issues were another scenario. And the first one is that students do not usually know what they're executing. They just copy and paste some random shells from internet without verifying. And this can lead to potentially security problems which to be more specific, we had crypto jacking which is like somebody entering into a system and hiding crypto currencies and sending them to their wallets. That was one. And the second thing is that we are getting lots of SSH logins per day. So due to those three scenarios, we require enhanced observability within our cloud system as well. So we now know that we need an enhanced observability. And the next thing is that we have to define those requirements for the observability solutions. And those narrow down to three topics mainly. The first thing is real-time resource usage monitoring. And this is quite conventional requirements for observability. And we require cluster-level usage monitoring views such as CPU memory and IO usage per containers as well. The second thing was detailed insights within each virtualized environment. We need to monitor users' activities inside containers. And by those users' activities, those include system calls and events within hosts as well. And this was the biggest requirement. And this was considered one of the biggest requirements that we had. The third thing is minimize performance overhead. Since we are running low-on resources and do not have a dedicated platform for observability, we require sharing those resources while running those virtualized environments as well. This especially is required for both collecting system information in the cloud as well as visualizing and querying those information. So with those three requirements, we have looked for possible options and combinations for observability. To be more specific, we required a two-leveled observability using two different combinations of system observability tools. First, we needed an overview, resource usage, and the second thing that we needed was detailed insights within each component. So for real-time metric-based monitoring, we have deployed Prometheus and Grafana combination. And as pretty much lots of people know, this is quite used a lot in other Kubernetes cluster as well. And since Prometheus is said to be optimized for storing high-frequency metric collection, this was especially suitable for real-time CPU and memory and IU usage for the cluster information as effectively. On the other hand, for detailed insights within each component, we have selected EVPF and ELK stack on Elastic Stack. And since Elastic Stack was optimized for log analysis and troubleshooting, as well as some machine learning features for anomaly detections, this was a quite great combination. Also, in terms of performance, ELK stack is said to be optimized for a large number of log data as well. So with those two combinations, two-leveled observabilities, we were able to achieve a higher level of observability as well. The first thing is that the first combination, which is Prometheus and Grafana, was quite used for retriving multiple metrics, such as the hardware resource usages inside a Kubernetes worker and collecting some metrics for each container inside those Kubernetes using the QB API. So with those two collectors, Prometheus does a metric pooling and stores them into a centralized database. And also we have also set up an alert system, which is fired when a specified amount of resource in the cluster has been used or some pod is using up too much resources and all those kind of stuff. Also, with using Grafana, we can visualize those real-time and time-series data, like the graph that I'm going to show in the later slide. So as you can see in this slide, this was our two-hour programming exam, which took a few months ago. And in those programming exam sessions, we had over 50 students just actively accessing the whole service as well. So in this scenario, Prometheus and Grafana has proven that it is quite great at monitoring and observing those real-time resource usages as well. So the next thing was EVPF and ELK stack. And before I dive into the second thing, I would just like to have a small time introducing EVPF. EVPF stands for Extended Berkeley Packet Filter. And as I'm guessing, there will be some people in this room who has heard of EVPF or has actual experiences using EVPF as well. So it basically extends capabilities of Linux kernel in runtime. This means that you do not have to modify the kernel source code or make Linux kernel modules at all. And in recent days, EVPF is quite used a lot in production environment for especially observability collection, tracing applications and troubleshooting with debugging, and also enforcing security context to a container runtime as well. So for us to use EVPF in the Kubernetes cluster, we had to have some multiple EVPF agents running inside each node and collecting those in-system logs and later send them to a centralized database. This was quite easy and easily solved by using Tetragon. Tetragon is an EVPF-based security observability and runtime enforcement tool and it can perform security events with, it can collect security events with Kubernetes aware contexts, meaning that it's capable of identifying a pod and its namespace when there is some process running inside a host as EVPF collects. Also, Tetragon exports those logs as a well-known format and with using those well-known formats, we were able to further export those data into our ELK stack or something else. So that was with introduction about EVPF and I'll dive into the second part. So with EVPF and Tetragon, which is the same thing, not the same thing but Tetragon and EVPF and with ELK combined, we can collect detailed system logs by EVPF in Tetragon and store those logs using elastic search and visualize them using the Kibana, which is just an ELK stack. So as shown in the picture below, Tetragon collects system information like a JSON and we had to bridge that part and after that, that was quite straightforward. So I'm going to show you a small demo but those were three scenarios that we were trying to demonstrate but unfortunately, EVPF and Tetragon has limited support for monitoring user space and user activities in user spaces. So we were quite not prepared for that one so I'm just going to demonstrate the first one and the third one. So the right side of the terminal is a Tetragon agent, a Tetragon monitor which collects live data from those containers. And as you can see, I have made a CURL request to that specific IP and as shown in the right picture, we can pretty much monitor that it's actually connecting a TCP connection and receiving such information, receiving traffic from it. And for example, let's do another thing which is creating a random file using DD and by using that command as it's writing file to disk, in the Tetragon part, we can see that it is actually connecting to the NFS server and writing those files as well. And since we were quite difficult to see those information in just terminal environments, we have exported them to the Elasticsearch part. So as shown in the video, we can pretty much see that some binaries are being... That was too fast. Binaries are being executed with some arguments and type of a system calls. As you can see in the right, there is a X64 SysRite which is a system writes a system call and the context it was requested on. And for the network information, we can also see those network information since the network are basically using those system calls. As you can see there are source IP addresses and destination IP addresses for a specific container as well. So basically that was the first part and the second part that as I have mentioned earlier, we have lots of processes being not managed properly. So I'm going to drop a fork bomb using Bash command and it will trigger lots of system calls as shown in the right side as well as SSH1. And it's creating lots of processes and since... So basically that was another example of monitoring those system calls as well. Yep. Ah, I'm sorry. And the third thing was failed SSH logins as we are getting lots of SSH logins per day which are basically trying to sneak into our system. The first right terminal will log in as my administrator account and we'll be doing a tail minus F for observing those SSH logins. The right part, I'm going to use the SSH and connect it with like wrong credentials. I'll just do it three times and that was it. And after that, we can visualize that using the ELK stack. As seen here, we can see OSS Japan which was connecting was not properly connected as well. So that was the second part. So with using those two combinations, we are quite able to achieve such observabilities as well. So for conclusion, I'm going to talk about like small achievements that we have made and our goals for the future as well. We were able to construct a cloud-based lab using Kubernetes and we provided access to virtualize Linux environments per user using controlled entry points in the system which we talked before. And we also constructed a measurement tool which can deploy lots of users at once and we were able to achieve lots of observabilities as shown like before. And the further challenges ahead. However, there were lots of extra challenges. The first one was the storage problem. Since we are relying our user data on a network storage which basically is a NAS with a hard disk drive, the performance overhead and performance degradation is quite high. The graph on your right top side is just for your reference and we did a simple FIO benchmark for writing and reading some random files. And as you can see, the blue side is native storage which our server runs in. And the red thing is the part when we try to write a file into a pod which is mounted to the NFS. So as you can see, the performance degradation is quite high. And also, as mentioned before, since we are sharing the network traffic inside a single cluster, the amount that we are trying to use for the NFS is taking up the traffic that should be shared with those Kubernetes nodes as well. And we have too many duplicated files. As I have mentioned earlier, we teach some courses with Android operating systems as well. We modify them and compile them so we have to have the full source code available. And as you can see in the graph below, we have lots of Android files and less amount of user data in here. So we have to deal with those duplicated files because students just modify a small bit of part and just share lots of same files as well. So that was one problem. And the second problem that we had was containers are basically sharing kernels. So we were unable to support kernel programming or some system CTL commands to each student. And since the kernel is being shared by the container and the host, using hardware requires privileged containers and privileged pods, which are not ideal in Kubernetes environment. So we are currently working on virtual machine VM-based Kubernetes extension, which is called Quake Cloud. And we are also working on enhancing observabilities within virtual machines as well. So that was a brief further challenges. And for future works, we are of course going to work on improving the cloud system and mitigating such challenges as well as the virtualized machine-based Kubernetes extension as well. And for security, we are trying to create a platform for enhanced security. And as we can just simply observe, but we cannot take action. So we are trying to perform some security enhancement or enforcement, which can be triggered on the fly. So we are trying to make a real-time kernel-level container security enforcement, such as using LS and BPFs. And also we are working on intercontainer causalities with distributed tracing, meaning that some container triggered another container to take action and another container was required to do some other jobs. And also anomaly detection and intrusion detection using machine learning technologies as well. And also besides those, since we were focusing on enhancing observability, we had privacy and less priority. And also the confidential data that students might be holding inside their containers. So we are trying to provide enhanced observability, but at the same time preserve those privacy issues as well. So it ends the presentation and I was sort of nervous and thanks everyone for listening. And I would gladly take questions here or outside this hall as well. Thank you so much. First of all, I think it was a great presentation. So I couldn't tell you nervous at all. So you use observability more of a reactive approach and I think it's great. Are you looking to use it as a feedback loop to some actions, as a feedback loop to be more proactive? Are you asking us if we are trying to build a platform that kind of feeds into the system and determines which is wrong and right? So for example, you mentioned the resource utilization, right? Yes. Technically it's a metric. You can use that metric to trigger some kind of automation to either put resource limits or kick out the pod out of the cluster. Are you looking into that? For now we are just doing a rule-based approach but in the future we are trying to do what you are trying to say, those kind of like positive feedback loops. Depending on the past data, we are trying to improve those rule-based actions as well as those kind of things that you have mentioned. Hello. Thanks for sharing. So you mentioned you use EBPF to minimize the performance impact. So have you done some benchmark on this performance impact? How much is it? Yes. We have internally performed multiple tests using EBPF. For example, I cannot remember the exact number but by using LSNBPF we were able to have like 9 to 11 percent degradation. We have optimized poorly but just for reference that's one number and by using like K-Probes or TracePoints we can get like boards and that number as well. So that's kind of a number that we can kind of like ignore as a performance overhead and that's kind of bearable for us. So thank you. Thank you everyone again and have a nice day.