 Welcome to our CubeCon North America 2021 presentation. It is our pleasure to meet you virtually in these events. In this talk, we will work you through our soulless architecture to anonymize personal sensitive data to preserve data privacy. Here is the agenda for today. First, let's look at the global legislation and landscapes on data protection and privacy preservation. What's their impacts on our industry and how should we address to these challenges? Next, we present to you the solution architecture through our target use case. That is to anonymize personal and sensitive data from images that are in a hosted storage environments. The rest of the talk were deep dive into each of the building blocks. We will show to you how we use, adapt, and fine tune these technologies to meet our scalability and security goals. Data protection and privacy is an increasingly important issue for global data controllers. Here must be taken to process, exchange, or store sensitive personal identifiable information and honor privacy preferences. So here we have a diagram from the United Nations dated year 2020. About 66% of the countries in the world have data protection laws. For reference, just a couple of years before that, the number is about 50%. So we anticipate more and more countries in the world will have data protection laws in the future. The IT industry addressed to such requirements by using two techniques, pseudonymization and anonymization. The difference is whether the transformed data can be twisted back to their original forms through another process. For example, data encryption and decryption is a form of pseudonymization. Once encrypted, the data can be used in computation, transmission, and persistence. In fact, data protection laws such as GDPR recommends pseudonymization because it protects data privacy. But data encryption and decryption often incurs computational and operational, sometimes even financial overhead. So this is not a solution for all the use cases. In fact, in many use cases, especially machine learning, the privacy and the personal privacy information are not used at all. So they can be completely removed from the original data form. And this is the process called data anonymization. According to GDPR recital 26, data anonymization does not fall into the scope of the law. And in the rest of the presentation, our use case is to anonymize data that is in hosted storage environments. Here is the high-level overview of our solution architecture. As you can see here, we use mostly CNCF projects, namely Rook and KEDA. Rook is the infrastructure orchestrator. We use Rook to create self-cruster and it creates a self-RTW bucket and also enables bucket notification mechanism. So activities within buckets are pushed to the message queue endpoint. We use KEDA as a service framework for our execution engine. So KEDA or Kubernetes events-driven architecture is a lightweight and service provider oriented. It's very suitable for our use case here. So the KEDA service functions use the machine learning algorithms to detect objects, mostly personal and sensitive information within the image. As there were so many mechanisms around, we use a hard feature classifier from OpenCV and deep neural network from TensorFlow with lots of generality. So the whole workflow is like this. When the images are uploaded to the self-RTW buckets, the activities are pushed into the message queues. The service that's subscribed to the queues really events and downloads the images from the bucket. Using objects detection algorithms define the region of interest and anonymize the personal and sensitive information. The anonymized image are then replaced original images in the bucket. The whole workflows is running on the microchips cluster. Microchips is a lightweight implementation of OpenShips and Kubernetes. It's very fast to start, usually in seconds versus minutes in regular Kubernetes and OpenShips clusters. This consumes very little resource. It can run on the small single node cluster with only two CPU cores and two gig of memory. We will spend more time on these components in a deep dive sessions. So now that we've seen a use case for bucket notification, let's look deeper into the details. So bucket notifications are built from several building blocks. The first one is a topic. The topic tells us where to send the notification to. The topic can aggregate different notifications from different sources to a specific endpoint. The endpoint could be a Kafka message bus or more precisely a topic inside a Kafka message bus, a topic inside a RabbitMQ server, an HTTP server, and we're gonna add more like this. We're gonna add AWS SNS, AWS Lambda, MQP10 and using some scripting and customization, you can add even a NATS message broker as an endpoint for notifications. The second building block is the notification itself. And the notification itself is what ties together the topic with a bucket. It ties them in a way that tells us which notifications we're interested at, which kind of events, creations or deletions or both. And we can also add filters that can filter out different objects from being sent or not sent as notifications to the endpoint. The last building block is the event itself. And this is the information that we're sending to the topic based on the configuration in the notification. So if you feel like setting up the system with the bucket notifications and the endpoint is complex, you're probably right. But here we have something that should help us. And this is Rook. Rook is a storage operator for Kubernetes. It's a graduated CNCF project that provides different storage providers for Kubernetes and has both file, block and object storage. We're gonna focus on the object storage. Now what do you get in Rook? You get kind of two in the price of one. First of all, your storage solution is actually hosted inside Kubernetes with all the value of orchestration and management to come with it. The second thing is that it gives you storage for your Kubernetes applications. Now I'm gonna show how in five simple steps, using five simple yamls, you can actually build the entire system that we described before. The first thing that you'll have to do, you'll have to define the Ceph object store. When you define the Ceph object store yaml, you give it a couple of parameters saying what is the size of the replica set and how replication is done, different pools that you have, and what's your object gateway. Once you have that set up, you need to define the storage class. Storage class is the type of storage that you have. You can have multiple storage clashes in your system, but each bucket that you're using has to belong to a specific storage class. Storage class has something like the claim policy and other parameters that distinguish the way the storage should be handled by those buckets belonging to the storage class. Now that we have the object store and the storage class, we can define an object bucket claim. An object bucket claim is what eventually generates those buckets. When you define the object bucket claim, you can give it a couple of parameters. Very important one is the storage class that it references. You can give it a name or a name generator. An extra bit of information that we can give here, and this is related to what we are trying to achieve, you can define as labels the notifications that we want on this bucket. So the only extra piece of information that is needed at the bucket level to enable bucket notifications is to reference the notification via a label. Labels are very useful here, like as a side note, just saying that whenever you want to, let's say, look for all the buckets that use the specific notifications, searches are becoming much, much easier. And also this is not very intrusive into the actual YAML of the object bucket claim. So now that we have the object bucket claim, we actually need to define the topic's notifications. You probably know, everything here is declarative, so you can define the bucket before the notifications are after, it doesn't matter. Everything eventually will fall into place and all the information will be available to our operator. So another stat that we take here is define the topic. As we said before, the topic is the endpoint. Here we define the rabbit in queue server as the endpoint and we're giving a couple more details of information. Some of them are related to the actual topic, like the exchange that we're gonna use or what is the acknowledgement level that we're expecting from them. The others are more generic like whether we're using persistent notifications or we can pass opaque data and so on and so forth. So now that we have the topic, the last big that ties everything together is the actual notification. Notification here just declares the association between the topic that we just defined and what kind of filters and events are interesting for our notifications. The association with the bucket doesn't exist in the YAML of the notification. Instead, this is just here in the label below as part of the object bucket claim. You can see here that we have to reference the topic and give some extra information about what kind of objects we're interested at. We can filter on the keys of the objects or we can filter on metadata, we can filter on tags, there are many, many options here and we also say what kind of events we're interested at. Now let's see what it actually means to send notification to the serverless function. The notification based on the configuration, similar to what we described before, is sent outside of the CEPH cluster into, in this case, the Kafka K-native bus. Now we have a serverless function that is subscribed to the topic that we published to Kafka and is getting the notification on the other side. This is an end-to-end solution. It is reliable because CEPH is going to store the notification using radars until it get the act from Kafka. Kafka is going to store the notification until the notification is being read by the serverless function and the solution is also scalable because Kafka is scalable, CEPH is scalable and the serverless function is gonna spin more functions as more messages are coming on the topic that they're interested at. So this is the whole solution which is reliable and scalable. So if we have a solution which is both reliable and scalable, why isn't that enough? Why do we need something else? Well, this works very well if the processing done by the serverless function is simple. Sometimes this is not the case. Those serverless function can have processing that takes a very long time and may fail midway. They can have processing that involves multiple steps that involve multiple functions that invoke one another and each one of them can fail in the middle or just go stale or crash or not be responsive anymore. For those complex cases, we need something else. We need the ability to know that something or some piece of work is never going to be processed in the end even if it was fetched from the queue. And we need a way to make sure that this piece of work is being processed later on by some other consumer that is able to process it. And we want to do that without ideally introducing a whole side of complexity into the system as another message broker or another system that allows that. Given those requirements, for those cases, we would like to introduce a native message QAPI where the actual message queue is embedded inside CIF. So let's try and summarize the two different words in which we can operate. The existing push mode, the messages are sent to an external message broker. The serverless function programming model is that it's based on the queue that stores the messages. Scaling is based on the utilization of the serverless function. The reliability of the producer, which is CIF, is the fact that we're storing everything in Redis until it's being act by the message broker. And the reliability of the consumer is based on whatever reliability the message brush gives them. Now in the new mode, in the pull mode, the notifications are stored inside Redis itself, so not in external system. Whatever reads the messages is based on an outscaling trigger which sees what is the size of the queue, approximate that and spins as much consumers as needed based on the size of the queue. The reliability of the producer is just Redis and the consumer's reliability is based on the fact that we have a mechanism where if somebody doesn't act the end of their processing within some given timeout, then we're making those notifications available for another consumer to consume them. We've chosen to model our message view based on the AWS SQS API. The reason is that it fits very well with our requirements. First of all, it has an at least once guarantee, which means that whenever a consumer reads a notification from the queue, the notification is becoming invisible. So other consumers can keep on consuming other notifications. This guarantees the scalability of the solution. However, as we said, processing could be complex and therefore we have a visibility timeout. So after this timeout expires, if we didn't get an act from the consumer telling us that the processing actually ended for this notification, notification is becoming visible again. So any other consumer can pick up on that and this allows for reliability in case of client failures. We also have retention period, which means that stale or obsolete notifications are being deleted automatically from the queue and we're making sure that they're not consumed by any client. Another reason for selecting the AWS SQS API is because anyone which is already a user of SAF probably have some kind of a client SDK that support all the AWS API because we are AWS compliant for S3 and other things. So by using the CPI, we're making sure that the solution is simple to integrate and doesn't require any special additional SDKs or client. We're using radars for this scalability and the availability of the solution and we're based on the AWS SQS, but we're not a general purpose message queue. We implemented that only for the sake of notifications. However, in the future, we may expose more APIs to make it a more general purpose message queue solution. And I know that I've said and we don't want to add more complexity to the solution. This is why we want the internal message queue or the native message queue, but sometimes you would like to use an external one. This could be for the reason that this external solution is already there. Or it could be for other reason. For example, your serverless function or your scaler in the case of Keda already want to support a certain type of message queue. You can use Kafka and this would be for a scalable cloud solution. You can use the wrapped MQ for a more lightweight compact solution. Both of them are coming with operators that makes deployment in Kubernetes much, much easier. Kafka has StreamG and that MQ has the built-in Kubernetes operator. So currently there are two CNCF serverless frameworks, Keda and Keda, Kubernetes events driven architecture. Both supports scale down to zero through auto scaling. The difference between the two is that Keda is more natural-oriented and Keda is more service-oriented. So here we have the two diagrams from the architecture picture. On the Keda side, you see that the serverless function is defined as a kNative service. This will manage two different concepts called routes and a configuration. The routes will split the traffic into different revisions, and revisions will go much different endpoints, ingress endpoints, implemented in the networking stack. When the kNative starts, it's using Istio as its default networking stack, but it's expanding beyond that and supports multiple plug-in models. So currently you can use Istio, Guru, Solo, and Corio, and probably even more. The next thing with the kNative is that the service functions is HTTP triggers. We only to focus on your business logic how to decode those cloud events and handle the events. We do not have to worry about how the events is sent to the service functions or the event routing happens in the background. On the other hand, Keda is very service-provided oriented. It deals with the service providers, such as Apache Kafka or some other QE mechanisms. It's by using metrics as a heuristics for scale up and scale down. It also has its own implementation of scaler. That's where just as a bridge between the queuing detection and service function verification. In the Keda service functions, it is more complicated. You have to deal with the events and you have to decode and you have to receive these and some certain cases you have to remove it once the event is already handled. But Keda is very lightweight. This does not require a lot of dependencies and we will see in the next slide that this is very suitable for our use case as a standalone single purpose service deployment. We choose Keda for our service framework in this solution because it has the following features. First of all, the service functions to process images have to be long-standing because as long as they are image upload and shoot RGW buckets, the service function have to stay alive. Otherwise, scaling up and scaling down could cause overhead and this is not quite efficient. Long-standing service functions will become an issue for auto scalers. This could mess up with auto scalers calculation and prediction. But this is not the issue for Keda because Keda does not require external auto scalers. Second, Keda is very lightweight. It does not need any dependent networking components. This is very suitable for edge computing or single-purpose applications. Third, because Keda is not HTTP triggered, the service functions do not need to have external endpoints. External endpoints must always available because of the next connectivity issues or security issues. And last, Keda preemptively auto-scale service functions to miss incoming requests. So this can make the real-time processing a possibility. Now let's look at anonymization service function. The service function first read the message from the message queue and find the location of the self-RGW bucket. It downloads the image from the bucket and detects if there are any personal and sensitive information, such as faces and license plates, and browse the region of interest with a rectangular box and then replace in the regional image in the bucket. As you know, there are many objects, texture and algorithms around, such as harm, cascade, classifier, and deep learning neural networks. All these algorithms have similar features. They both detects well-defined features as different scales and then match up to the predefined objects. The half-feature cascade classifier is very fast, but it can only do one detection at a time. It can only detects one class. If you have multiple class, then you have to have multiple classifiers. And this is obviously not quite convenient. DNM models on the other hand can detect multiple classes, but they are more complicated and often require high-performance hardware like GPU and ISIC. So without loss of a generality, this solution uses the harm-feature cascade classifier from OpenCV to detect faces because faces have well-defined landscape and also using the pre-trained TensorFlow model for license plates detection because license plates have different forms and not all forms are the same. So deep learning models is the better choice in this case. And the last, let's look at the cluster orchestration system called MicroShift. So MicroShift is a lightweight implementation of OpenShift and Kubernetes. It is optimized for edge computing use case that has small factor devices with resource constraints or for environments that only serve single-purpose workloads as this one. It provides a minimal, yet customizable OpenShift experience. When it gets booted up, it's only start of very minimal OpenShift components, but you can add more and it customizes with ease. It is a single binary with both the data and the control plane, so you can deploy it either as an RPM or as a container. It runs on Linux, Mac OS, or even Windows. Currently it supports both AMD64 and ARM64, whilst the community have already developed a binary to deploy on RISC-5 and power systems.