 Hello, hello everyone. Good afternoon. This session we are going to talk about eBay, geo-distributed database on Kubernetes. We come from eBay. First, let me introduce myself. My name is Xinlong. I worked in the eBay data platform team. Actually, I'm a Kubernetes user. I'm not a Kubernetes developer. And this topic includes two parts in this agenda. So first, I will introduce the eBay geo-distributed database. I will share the experience of how we deploy such a database on the Kubernetes. The second part, my colleague Chenyue, he's from eBay Kubernetes team. He will do a deeper dive on the storage management in Kubernetes at eBay. We did a few extensions on the storage part on top of the Kubernetes. And this is the core thing to support us and running the database on top of the Kubernetes. So first, let me start from the vision of the eBay geo-distributed database. As you know, eBay, our vision is we want people to shop any item from eBay, from any place, any time, while any device. So we have the requirement to want the data close to the user. So that's why we develop the geo-distributed database. And the vision of this database is we want to provide a resilient, collaborative, and closely effective data platform to support eBay's global business. So we want it always available, and also it can be highly scalable to support the business group. And also it should be high performance. Since we deployed our cloud-native environment, so we also want to operation agility. We want everything to be fully automated. Otherwise, it cannot survive if we don't have this automation or cloud-native environment. So this is our vision of our geo-distributed database in eBay. So now let's look at how it looks like. So actually our geo-distributed database idea is quite similar like the other distributed database. So while we want to distribute the database, the dataset is too large. We cannot fit in one node. So distributed actually have two meanings. One, we want to split the large dataset to a small chunk of the dataset. In this case, we call it a shard. Actually it's a shard or range of key range. And physically we call this replica set. This is a replica in our database, not the same as the replica set concept in the Kubernetes. It's quite similar. So each shard actually contains a small range. It's just a small dataset of the global dataset. And the second of the differences we want to this data have a few copies. And these copies will be distributed on multiple machines on different zones. So that's the logical view of a distributed database. Now I have a graph to show how it physically deployed in a real environment. So the database will be deployed on multiple zones. On the bottom layer we call this storage layer. The storage layer just many storage nodes. Each node is a bare minimum machines. It's a physical hardware. In each storage node, we spawn a few database engine process on this storage node. In the graph, the same color actually means the data copy is the same. And the number I put in here just means the shard ID of the data. So logically a shard, for example, the number two shard, it will be in the first and the second and the fourth node. Actually this storage layer is quite large. The scale I talk about in here it will be thousands of this database engines. So how our application to access this database data to the read and write. Since we have thousands of the storage engines on the storage layer, we cannot let the application to connect to that directly. So we introduce a service layer. This service layer to routing the application request to the storage layer database engines. So this service layer actually it is a stateless. But how it can do the routing? Actually we have in the service layer it have a topology map to know the layout of the storage layer. But how did it know this layout? In the left side we call the orchestration layer. We have a coordinator. This is a coordinator, the service layer and the storage layer. The coordinator actively monitor all of the, actually it control all of the storage engines on the storage layer. It manages the state and manages the change. And it has its own metadata storage state. Meanwhile the coordinator also pushes this change to the service layer. So that will make the service layer kind of knowing our topology layout, the physical layout of the storage layer. Then it can routing the request based on this topology map. Meanwhile we also have the monitoring system. Because sometimes for this storage engine it may have some slowdown or some other problems. So this monitoring system also actually monitors the database engines in the storage layer. And it sends some signal in case something is a failure to the coordinator and the coordinator takes some action. So now what I have talking about is a gesture for our GOD's real database. But how it relates to the Kubernetes. So actually we deploy a whole sense including the storage layer and also the orchestration layer on the Kubernetes. So on the orchestration layer and the service layer actually it's typically the most so it's not a big problem when we deploy it on the Kubernetes. The core part is how we match this storage sense on the Kubernetes. We know Kubernetes is great but is it easy to deploy such a complex state of workload on Kubernetes? The answer is it's not easy. So now I'm going to talk about when we implement such a database. We found a few challenges when we deployed on Kubernetes. So the first one is we are talking about a distributed database. We have a GO distribution. So typically we will deploy this sense on multiple clusters. Typically Kubernetes is just our local availability zone or our local data center. It's just one cluster. So first we present each problem is we have to manage multiple Kubernetes clusters. We need to distribute our database engines on multiple clusters. So once we deploy on the multiple clusters we know Kubernetes has its own internal network. Inside it can have a good connectivity. But if you cross multiple clusters I have a database that has a three corpus on two data centers. It want to do some real-time reputation. So that we have the second issue is how we enable the network connectivity between the database engines on different clusters. So there's a second problem on the network side. Since we are talking about we want to do a high-performance closely effective data platform. But on Kubernetes by default if we want to persistent data we use the persistent volume. The persistent volume typically is a shared remote storage. But if we want to get a high-performance closely effective this cannot fulfill our requirement because remote typically means slow. It can maybe low IOPS. And shared means sometimes people can, the process running on the same storage will impact each other. And the next one is we persistent data and we want to a few data, we want to control the data placement because we are running our database. We don't want to same copy running on same machines. So there will be a risk. So actually we want to know the data location, we want to control the data placement. And another thing is we want to do some optimization. For example, when I brought in the request from our service layer to our storage layer we want to choose some close data node that can minimize the latency. So the database is more than like a pad. It's a very complex pad. It's more control of the process. The next one is in the database we manage the data. It's not just like your deployment. You have a pod, it's gone. You create another new product to add to the deployment. We are, we not only need to create the pod, we also need to move the data from somewhere to restore the data. So the changes we need some data lifecycle management. It includes data backup or data movement between a few database engines. So this is the main changes we faced when we tried to deploy such a workload on the Kubernetes. So now let's look at how we reserve this issue. So on eBase, actually we have a little bit advanced than the community. So I just saw the container network interface. So this maybe can also support the requirement like we did in eBase. So in eBase we assign each pod as a global IP. So that will make us on different Kubernetes cluster the pod has a network activity. So we have a controller. When I schedule a pod, it will assign a global IP. And this is global between the different clusters. So this will reserve the network connectivity issues. And the next we found if we want to use a very high performance, very high IOPS and also very dedicated volumes, we found only the local disk or the disk just maybe on same rack can be mounted like I discuss this local things. We can achieve this high OOPS and also dedicated without any impact for other neighbors. So this by collaboration with our eBase Kubernetes teams, we utilize the local disk as the positional volume. That means we can request the positional volume, but actually it return a local disk to this pod. And meanwhile, when we register this position volume as local disk, we put our physical location info like the rack and host inside as the labels for the local position volumes. Then can allow us in the application layer, we know the location of this disk and to know which rack it is and which host it is. Meanwhile, we enhanced the, actually this is also done by our Kubernetes team, they enhanced the part of the anti-affinity features from the pod level to the volume level, because when we schedule some things, maybe first I need a positional volume then I schedule pod. So we have some special implementation can also support a volume level anti-affinity. That means I for same shard data, I don't want to this shard the two positional volumes, not in same host or on same rack. So this will make us can get the high OOPS we can make our database high performance. Also we can support the anti-affinity this location sense. And next one is we found we will do the database. Database always need as a backup. If we somebody is broken even we have a multiple copy but we don't have like the time machine capability to know the history of the data. So the data backup is sort of quite important for us. We have a solution to to utilize the local disk this just like a device on the device level we do the snap short and keep sending the delta to sync it to network volume and use this mechanism for our backup. This local disk can still work with the high OOPS. And then it just keep doing the regular delta send the remote volume. Because if we just use the remote shard storage you can maybe doing the snap short on the remote shard storage. But now we use the local disk so we have additional this features to enable this backup sense. And now this is just on the infrastructure level we reserve the network issue and also reserve the disk issues. But we found after we utilize this feature we are not only we cannot use like a deployment or set because this just means Kubernetes can control your port life cycle. Now I cannot utilize because we have very special things like I need to control the local disk and port must be together. It cannot utilize these features. So then we choose a solution we could just use the port. We use Kubernetes to just manage these containers. And on the up layer we create our own orchestration layer to maintain the USB state. So that is the orchestration layer I mentioned in previous slides. Just we maintain our USB state. But how we can maintain the USB state? So first we have our own configured store to store our desired state. We also actively monitor all of the physical resources in the Kubernetes. Then we compare the difference to take some action like a loop as we watch the Kubernetes change and take some action to make it match our desired state. This is different like Kubernetes which obviously is more like a multiple data center because multiple Kubernetes classes and it maintains the USB while our own data store. Also we utilize since we call the workflow to maintain the USB. Because like I said before we have a few like data move, data operation and also Kubernetes operation. These are together. Sometimes I create a new port and need to move some data. So not just like a very simple operation it has a few complex steps. So we utilize the workflow mechanism to implement these complex things and maintain this USB state. Now here is an example for how I do the self-remediation. This example for example I have a replica. This is maybe the host that is down. Kubernetes deleted this replica. This is in red. We are watching this event. We are watching this event. This event will be sent to our coordinator. Actually we have two level coordinators. Each has its own local one. This will avoid one data center to impact others. We have a logically global one. It also distributed on multiple data centers. This local one will forward the event to the global one and the global one will make the decision and it will trigger the mission mission. I just sent our call to the local coordinator to create a new replica added to this shot. So this is quite similar to how Kubernetes works, but we just expanded on multiple Kubernetes clusters. So here is all of my talk. Actually I just share our eBase GO2 database and how the changes we are facing when we move on to this and how we solve these changes. And the core part of our solution actually is based on eBase Kubernetes storage management solution. And I also think this pattern is quite typical. Maybe it can be extended to as a high LPS GU distributed for workload. So that's all my part. I will forward it to my colleague Chenyuan for the next part. I'm from eBase Kubernetes team. My working domain is on OS management and volume management on the eBase Kubernetes cluster. Next I'll give introduction of storage management on eBase Kubernetes cluster. It supports GO distributed database. It can also support other eBase application as well. And here is the agenda of my topics. Here first I will introduce our storage classes. Next I will introduce our local volume solution, network volume solution, backup solution. Lastly, I will give a simple summary of our future work. Okay, storage classes. In our Kubernetes cluster, we define two storage classes. First is performance. It can provide 50K LPS and also maximum 500 MB per second throughput. The typical user case is Northeco database or some other eBase in-house application. The storage bank is local SSD. On Kubernetes it's our local SSD volume. The second storage class is standard. It can provide the 300 LPS or maximum 300 MB per second throughput. The user case like backup service or eBase ICD that Cassie can introduce yesterday's session. The storage backend is self-HDD based class. In Kubernetes it's our standard volume. Okay. Then go to the local volume. We started our local volume work last year. At that time, there are quite many network volume solution in Kubernetes but no local volume. We compared the network volume. We found that we still have quite many strong reasons that we must implement local volume. First is the performance. Actually, I think it's quite common sense that local SSD performance must be better than network volume. I think especially for the higher latency. We have a data test that if we write a 4KB block on local SSD, it takes about 200 nanoseconds. Why if we write it on the as Cassie volume, it takes about 1,800 nanoseconds. So performance local SSD is much better. Second is cost. We all know that commercial high-performance network storage is very costly. So the reason is availability. I think most of us will think that one local SSD reliability is not good and compared with network volume, its availability is not good as well. But actually, I think nowadays most distributed database actually management the data evaluated by their own. They usually have several copies on different nodes, on different disks. Actually, that means the data availability is guaranteed by the application level. So single SSDs reliability is not a big issue now. Also at that time, Kubernetes only provides the empty DR and the host pass as the way for pod to access local disk. But these two volume also has quite obvious limitation. So first, they don't support PV and PVC. Second, they cannot guarantee they are your size or IOPS. For instance, if a host pass share the same disk with root FS the pod maybe can use all the disk space on the root FS. So as this reason we start to implement our own local volume. Majorly, we do these three changes to the Kubernetes code to implement our local volume. The first part is local volume plug-in. We implement this plug-in to support hard disk and hard disk partition. This local volume plug-in is similar as all other Kubernetes volume plug-in. We implemented the major interfaces, like monitor, arm monitor, provisioner and deleter. And so we registered this volume plug-in in the Kubelet and the kube controller so that it works as similar as all other volume plug-in in the Kubernetes. Second thing we implemented PV and PVC. So user can just create one PVC for the local volume and use this PVC in their power spec same as all other PVC. For the PV creation, currently we only support static PV creation. So during the node provision, we will generate a config file. In this config file, we will indicate which partition or which disk can be used as the local volume. So when Kubelet starts, it will read this configuration file and it will create the PV instance in the ECCD and also mount the disk on the host node just like these two disks. So we can mount them first on the host and when the pod starts these two disks can be mounted in the pod's name space so that this pod can access the disk directly. The last part is changes in the scheduler. The reason is that local volume PV and PVC bounding is special. It's not the same as the network volume PV and PVC bounding. The reason is that whenever local volume PV and PVC are bounded it means that the node is also selected. This will break the later pod scheduling policies. So in this reason we postpone the PV and PVC bounding in the scheduler instead of in the PV controller. We did two changes. First is that we add a local volume predict function. So in this function we will check every node if it can fulfill the requirements of pod's local volume requests. And second is that we in the Kubelet's scheduler 1 function we do the final PV and PVC bounding just behind the the pod bounding to the node. So all these three changes in Kubernetes for us to implement the local volume. Here is an example that how can we use in the local volume PVC in our production. So we just define our local volume PVC I think as a storage class as local SSD. It's similar as the other PVC and in the part spec we just put the PV name so that's it. Here are several parts that we ever considered during our design and the implementation of local volume. First thing is where to do the PV bounding. I've mentioned before if we need to do the PV and PVC bounding in the pod scheduler or do it in the PV controller. As mentioned before if we do it in the PV controller then it will break a lot of pod scheduling policy. Like our customer geodistribute database they require the anti-affinity feature and also they define the pod memory and the CPU resource this kind of scheduling requests. So we must fulfill these requirements. So this make us postpone the PV and PVC to the controller. Another thing is that some features, some PV features will be, implementation will be more complex if we put this PV and PVC in the PV controller. Just like if one pod requires two local volume PVC. So if we do it in the PV controller then it's possible that one PVC bound to node A another PVC bound to node B then how can this part be scheduled. So as this kind of reason so we decided to put this PV bounding in the scheduler. And the second point is that how do we recycle the local volume. So when the PVC is deleted we also must delete the data on the PV. This actually at the very beginning we just start using the recycle pod way. So whenever we delete the PVC we start a pod and schedule this pod to this node. So using this part to remove all the data on the specified PV. At the very beginning we are worried about its scalability. But we test it to delete 3,000 PVC at one time and we found that this solution can still work well so we keep it in our production still. The third point is about disk failure monitoring and remediation. We know that disk actually is the most unreliable device in the machine compared with the memory CPU network interface so its failure rate is high. So currently our solution is that we start a demo set pod and call the some real controller CTL like Max AI or Smart CTL these tools to detect disk failure and whenever we detect disk failure and we send out alerts. For the remediation part currently we can only still do the manually remediation and this maybe will be improved later. Okay, then we will go to the network volume. I think network volume still plays a very important role in our netis cluster. One of the most important feature for the network volume is that it can attach to a node so but whenever the pod is deleted or crashed, so if user want to start a new pod or another node this network volume can still be reattached to another node and the pod can still use this same network volume. This is one of the most important feature of net volume in Kubernetes. Second thing is that I think when user using the net volume they suppose all the data availability has been guaranteed by the bank end of net volume like CIF. So they don't need to care too much about their application how to guarantee the data availability and will make user much easier. So this is the reason that we still need to keep the net volume in our production environment. We are using the single volume as our net volume solution. At the very beginning we are using the standard open stacks in the volume but after one year user experience we found that it has some problem. First is that it has too many depends on components. Second is that it cannot support bad mental. So as this reason we switched to using standarons in the provisioner from Kubernetes community. So this figure shows two different solutions. The left side we can see that this is the standard CINDA plus NOVA solution. This is a Kubernetes node. It's a VM. We can see how many components it depends when it's one to do the volume attach-detach. We can see he has a CUMUL, NOVA computer, RebitMQ, NOVA and CINDA. So in production, so every two of these components, if they have some data drift, then will cause our volume attach-detach failure. It occurs quite often in our production environment. That's the reason we switched to standarons in the provisioner. So by using CINDA standarons in the provisioner, we can still let users to use the CINDA PVC interface. But when users create standard PVC interface, actually the standarons provisioner can help you create an RBV or ASCASI PV in the Kubernetes side. So it has much less third party dependency. And it can improve the network volumes attach-detach reliability. Okay, then we go to the backup solution. Currently our backup solution is based on CINDA volume. Our storage team SAFTEAM has already implemented a snapshot controller. So like here, they have running a snapshot controller for each CINDA cluster. So when users create a port, and they create a PVC and if this PVC is net volume and this net volume is CINDA volume, so the snapshot so if they want to do the snapshot, they can send a request to the snapshot controller. This snapshot controller can trigger the CINDA to do the snapshot. And later when the client want to recover the data from the snapshot, they can send a request to the snapshot controller too. And the snapshot controller can make one of the snapshots via volume, CINDA volume and return the volume ID to the controller. The controller then return the UUID to the snapshot client. So when users get this volume UUID, it can create another port by specify this volume UUID in the PVC and create another CINDA volume PV. And this new port can use this recovery data. This is our backup solution in Kubernetes. Okay. Lastly I will give simple introduction of our future work. First is that we need to align our code with upstream of local volume. Especially tourist cargo upstream has merged topology aware volume schedule PR. So I think most of our required feature of local volume has been upstream. So later we need to do some work to align our implementation with upstream implementation. And the second part is that we need to do some disk and partition modeling during our provision. So that the purpose is that we can support multiple hardware SKU in our production during the node provision so that during the provision we can use in this model to specify which disk partition can be used in the local volume. It will be more generic solution. The third point is that we will consider the volume for the local disk. We want to take advantage of the powerful 5 systems DFS and some of the features of DFS maybe will help us to enhance our local volume something like snapshot this kind of feature. Okay. So this is all for my introduction. Thank you.