 Thank you, everybody, for joining us, SCD maintainer track talk. My name is Wen Jia from Google, and I am a SCD maintainer. I'm also a co-chair of SCD with James here. Joining me today in this talk is Siyuan in Benjamin. Do you guys want to have a quick introduction? Hi, I'm Siyuan. I work for Google. And currently, I'm a contributor to SCD and API machinery. Hello, everyone. I'm Benjamin Wang. I'm from Broadcom. I'm an SCD maintainer and also a SCD tech lead. Yeah, thank you. All right, today we have a packed schedule. We will talk about what we have done in the SCD project for the previous couple of months. And then we'll look forward and see what's coming up. So, there. SIG-ACID is newly started SIG. Some of you probably were in Chicago, Kukang. And it was started almost right before the Chicago Kukang. It's the newest SIG in Kubernetes. And yet, James and I, we are honored to serve as a chair of the SIG in America. And then Benjamin Merrick are the tech leads of the SIG-ACID. So this SIG owns the SID project and how Kubernetes use SID. All right, now Benjamin, Siyuan will give you an introduction of the updates of the project. Hello, I'm Benjamin Wang. Firstly, I will provide some updates on the project. What we have done on 306. Yeah. Firstly, the B-Boat is one of the core dependence for SID. We have some improvements on B-Boat. The first improvement is we add a log into B-Boat. Obviously, B-Boat only returned an error for each API core. Now, B-Boat not only returned error, but also print log. We can help the developer to debug what's the issue. The biggest progress is we are resolving the data corruption issue. In the past three years, we received 18 corruption issues in the community, in both B-Boats and SID community. The symptom is B-Boat may panic by loading the DB files. It took us a long time to analyze all the data corruption issue. Eventually, we identified a couple of reasons. The first reason is when the application inserted data into B-Boat, they B-Boat the first step. You're not B-Boat USB property to index all the data. The first step is to locate the right place for the key. The second step is inserted the K-value into the place. If the application update the key in between, then probably the B-Boat will insert the K-value into the wrong position. This is the first reason for the data corruption. The second reason is coming from Linux kernel official, Fast Commit. Fast Commit is official introduced in Linux kernel 1.10. It may cause data loss in some corner case. Yeah, from B-Boat's perspective, the root call is the system called Datasync may return successful. Even the corner hasn't seen the data successfully. Yeah, please leave the link for more detailed information. We also developed some soldier command to fix the corrupt DB files. But we need to understand the key point to prevent data corruption use redundancy. We can't beat the hardware issue. That is, the storage may break in product environments. We should have multiple replicas of the data in product environments. For example, multiple members in a cluster or real-time backup, so that we can tolerate any single point of failure or corruption. There are some other minor improvements. For example, we support moving buckets inside the same DB file. In B-Boat supports hierarchy bucket structure. Each bucket can have some child bucket. We can move a bucket from one parent bucket to another parent bucket without moving the data. Yeah, it's minor improvements. We also support inspecting the database structure. For example, water buckets are in the DB file. And how many k-value pair in each bucket? We also, there are also some minor performance improvement. Yeah, we are planning to release B-Boat 1.4.0. Please draft the issue for the DTR release plan. We have already released 1.4.0 alpha 0. Yeah, please draft the change law for the complete change in B-Boat. Yeah, the rough is the second co-dependence for LCD. You know, rough has already been moved into a separate wrapper under the S&D organization. We also changed the module name. We removed the S&D from the path. Please draft the issue for the DTR information. The biggest change in rough is we support asynchronous writes. You know, rough follows a minimalist design philosophy. Rough dedicates the storage networking to the application. So in our case, the application is LCD. And the LCD and the rough communicate with each other via our channel. So LCD receives all the messages from the rough channel. Previously, LCD must sync the write or catalog. Synchronously, we need to, you know, in the right side, in the right side, you can do as a follow-up. Previously, LCD must sync the write or catalog in each iteration. But now, if they enable the official, LCD can sync the write or catalog asynchronously. It can reduce the entity and the latency by 20% to 25%. Yeah, please draft the PR for more detail information. There are some other minor features. For example, we add a forget the leader. If I'm pretty sure the leader is dead, we can call the forget the leader on their followers. And then instruct one follower to campaign. Then it can be reacted as a leader immediately. If we don't call forget the leader, a new leader can also be automatically be elected. But we have to wait for election timeouts. If the purpose of forget the leader, we don't need to forget the wait, no wait at all. Just get a new leader elected immediately. We also added a config item, step down on removal. When a leader is moved, we can instruct the leader to step down. Yeah, please draft the change log for all the minor improvements. We are planning to release route 3.6.0, a pretty rough issue for the DTR release plan. We have already released 3.6.0 of 0. Yeah, you know, be both on the rough are the two co-dependence for LCD. Both LCD 3.4 and 3.5 depend on both 1.3. And LCD 3.6 depend on be both 1.4. You know, rough is included in the LCD wrapper in 3.4 and 3.5 release. So there's no dependency. But we move the route into a template wrapper, starting from the 3.6. So LCD 3.6 depend on rough 3.6. Yeah, please draft the link co-dependence mapping for the data information. You know, LCD not only support the GRPC API, but also support the REST for API. There are some examples for the REST of API. We actually support the REST of API using the GRPC Gateway. You know, GRPC Gateway v1 works with the ProBop v1. And the GRPC Gateway v2 works with ProBop v2. In 3.6, we bumped the GRPC Gateway v2 v2. But the problem is, we are still dependent on ProBop v1. So we run into some compatibility issues. So eventually, we applied a patch on the source code generated by the GRPC v2 to make sure it can work with the ProBop v2. Oh, sorry, v1. Yeah, please draft the PR for more data information. Yeah, over to you, Siwan. Since last KubeCon, we have added some new house check endpoints, that's LiveZ and ReadyZ. So LiveZ is an endpoint that reflects the fact whether the SED process is alive or not. So LiveZ will return failure if it needs a restart. And then the ReadyZ endpoint is to reflect the fact whether the SED process is ready to serve traffic. So with these two HTTP endpoints, the SED house check is fully compliant with the Kubernetes API. So if you're configuring your Kubernetes probes, please use the new house endpoints. And this work is done in collaboration with Charles from Amazon. And then another one is about the sub-project. So Augur is a tool used by Kubernetes to check SED data. So this is a project authored by Joe Betts. And he has kindly donated the project to SED organization. So with Augur tool, you can directly access data projects stored in SED by Kubernetes. And then you can directly encode and decode Kubernetes projects using the Kubernetes encoding scheme. And then probably the most useful feature for this tool is to analyze the data stats by object kind. So you can easily find what kind of objects are consuming the most storage in your SED. And last but not least, yesterday we just released SED 3.4.31 with the work of James and Benjamin. Yeah, I will quickly go through the roadmap or priority for the 3.0.6. Yeah, this is the roadmap for the priority for the 3.0.6. We have already covered some items in previous slides. So I want to highlight the most of two important features in 3.0.6. The first one is supported downgrades. The second one is storage to deprecation. 3.0.6 will be the first version, first minor version, to officially supported downgrades. So we can support downgrades from 3.0.6 to 3.5. The high-level idea is when user runs downgrades, the first step is to migrate the data schema to the previous version. And the second step is to replace the binary or image with previous version. We do a similar step for all the members in the class of the one by one. We have already deprecated the storage 2 in 3.0.4 and 3.5. But the user can still enable it. In SED 3.0.6, the storage 2 will be decommissioned. All the data will be only persisted in storage 3, also known as B-boats. But we still maintain the V2 snapshot to support downgrades to 3.5. Just mentioned previously, we are planning to release rough 3.6.0 and B-boats 1.4.0. Yeah, just mentioned by SIR, we already added to separate endpoints for liveness and the related head check. But previously, we only have one head check endpoints. And we use a parameter to differentiate to use case. But it's not easy to use. It's hard to understand. So we added to separate endpoints for it. It's much easier to understand. Yeah, just mentioned previously, we also bumped the GRPC and the GRPC gateway to the latest version. Yeah, that's all from my side. So now I will talk about some project opportunities in SED. So a lot of contributors are asking, how can we contribute? So here are some areas we are working on and need some help with. So the first one is on downgrade. So as Benjamin mentioned, downgrade is one of the most important feature we will release in 3.6. And we're still in the middle of it. So we just added downgrade support from 3.5 to 3.4. And currently, we're working on the downgrade support in 3.6 to 3.5. So these are the tasks we are in this effort. So the first one is we introduce a new storage version stored and persisted in the DB file. And the second one is we introduced a DB schema, a structuralized DB schema, and then added annotations to the struct and proto to distinguish different fields added at different versions. And then there's a new downgrade command to initiate a downgrade and then coordinate all the members to lower the SED protocol vision and allow all the versions to join. And then there is a new SED util migrate tool to downgrade the DB files and the wall files offline. And then we still need to implement the online migration of the snapshot files. And then we would like to add more downgrade tests. So if you're interested in working on this, please contact me on Slack. And then another effort we're working on is to migrate all the test workflows into Kubernetes infrastructure. So we're a new SED in Kubernetes now. So we want to utilize the Kubernetes prowl infrastructure for our tests. So we're in the middle of moving all the SED GitHub workflows into the prowl infrastructure. And hopefully with this migration, we can automate the flaky test detection and then we can easily create issues and charge these issues. And also we need a lot of help in helping, a lot of help to deflake some of the SED tests. So if you're interested in flaky tests, please contact me as well. And then so another important aspect is for the SED performance qualification. So for any new feature we add in SED, it's important to know that we're not degrading the performance. So performance qualification is important. So with the help of Evan, we recently rewrote one of our performance tools in Golan. And then right now we're working on creating new IME demand and periodic prowl jobs to make the performance testing more accessible. So once the performance tools are running regularly on prowl, we also plan to create a Kubernetes style scalability interface to visualize the results over time. So this is just to make sure we have a stable performance across different new features for SED. So if you're interested in this project, you can contact James. All right, thank you very much Benjamin Suen. So with that, I want to have some community shout out. The SED mentorship program has been running for almost a half a year since the last Kupkan. We have had a lot of progress. So way from Microsoft, he has been working with Benjamin closely on Beboat area and he has been making consistent and high quality contributions to the code base. And now he is our newest reviewer in Beboat area. Thank you very much, Awei, if you're here or maybe you're watching the YouTube video. And also Mustafa from Redhead is also working with Benjamin in way on the Beboat area. And then he has done a lot of features in Beboat. So remember that page of Beboat. The reason why we have so much in good improvement in Beboat is because of the dedication of the engineers. Yeah, thank you, Mustafa. All right, the next two are mentees of Merrick. So Suen, as you can see, she has done a lot of great works in the high priority areas including downgrade, health check, auger, what else, test framework. So thank you very much for your great contribution. All right, next one is very special. So James is already the chair of SIG-SID, and he has already done a lot of great works and continuously in SID, but that's just not enough for him. Asked to be a mentee and work with Merrick's help. He is now taking over the performance area and has been making a lot of great effort, making sure the performance is not regression. Thank you very much. Okay, now last but not least for sure. Benjamin and Merrick, thank you very much for your dedication on not only contributing to the project, but also growing the community, growing the members. Thank you very much. None of this would have happened without you guys. Thank you. While the SID maintenance ship program is still running, we want to introduce the sub-project government framework in SID. This is not anything like something very new. It's very similar to the setup of the rest of the Kubernetes 6. So it aims to streamline the development and enhance collaboration with our growing community. So with the clearly defined scope of the sub-project, we want to empower the sub-project maintainers to take the ownership of specific area. And the transparent decision-making process ensures the consensus as well as welcoming the community input. So this structure would foster the innovation while maintaining the stability of the overall project. So to determine if something qualifies a sub-project, there are several considerations that we have in mind. First, it should be some substantial area of functionality in SID. And second, we want to make sure there are dedicated individuals with expertise in each individual project. And it should be something that serves the needs of wider SID community. With that, we have some of the potential project candidates in mind. The first one is robustness test. Merrick has been talking about this in several of his previous talks you can find on YouTube. So this is very important to make sure that the SID is resilient in production environment. And the next one, CI-CD workflow currently led by James, is focusing on the development and delivery process where we reduce the manual overhead and improve the reliability. So next one is SID Cuddle and SID U-TL. This one is relatively distinct development path and needs than the rest of the project. So I think we want to make sure it gets attention. It deserves to continue to serve as the powerful and user friendly tool for SID administrators. Last one is a kind of question mark for us at this moment. So Kubernetes is not using it, JIPC proxy. Kubernetes is not using it. And then there aren't many use cases that we are aware that's using it. So it's really hard for us to make a decision on if we should continue supporting it or not. So if any of you guys are actually using it in production or your colleague, friends, families are using it, let us know and help out. And I just want to make a note. Before we can find dedicated people other than the current maintainers on all of this, the current SID project maintainer will continue supporting all of this area. But yeah, think about it. If you're interested or if you're actually already have expertise in this area, please reach out to us. All right, now with a lined up opportunities in this project and the frameworks that will support you guys to help us out, now let's see how to continue connect with SID. How am I doing this time? All right, so right here right now in KubeCon, we still have a couple of sessions that's not done yet. I think right after this one, we have a CRD and dedicated SID as storage backend. I think this is from our Sillian and SICK scalability friends. I'm planning to go to that one, join me if you want to. And then on Friday, tomorrow, 11 to 1.30, there is Kubernetes Meet and Greet. It used to be called SICK Meet and Greet. So all the special interest group leads and contributors will be there. So please, if you have any questions, if you want to contribute to certain areas, if you want to be that dedicated individuals on some of the sub-project, please come and talk to us. And then tomorrow, almost the last session, 4.55 to 5.30, Sillian and Bogdan will give a talk about unleash the power of SID. And then throughout the week, we will have an SID kiosk at the Project Pavilion. So we'll always have people there, except now in the SICK Meet and Greet. But you know where to find us. And then it doesn't stop at Kukang as usual. You can always join in us at our weekly SID meeting. We're alternating between regular community meeting and triage meeting. So you can find something that you can work on if you're interested on the triage meeting. And then it's Thursday, 11 o'clock, Pacific time. And I promise we are seriously considering EMEA and Asia-friendly meeting time. Like, stay tuned. It will happen. And we are different channels for offline discussions. We have SID Dev, Google Group. On the GitHub, we have SID discussion page. And SICK SID is our Slack channel. And you can find all this information in the community web page. And with that, thank you very much. I want to invite Merrick and James onto the stage and to answer any questions you have. I think we have a bunch of microphones around. Questions? If you have questions, you can come up here. Thank you for all the improvements. I have a question about downgrade. So one of the big impediments to Kubernetes downgrade was SID wasn't supporting downgrade. Do you know if there are any plans for adding support for Kubernetes downgrade now that SID is going to support downgrade? That's one question. And the second question is, is SID downgrade non-disruptive? And for the snapshots originally with a new format, you were asking about, do you also have on the roadmap the plans to convert them to the old format as part of downgrade, right? That was. So regarding the Kubernetes downgrade, I cannot speak for that. But if you just want a downgrade SID in Kubernetes, it's now possible to do. And for the downgrade process, yes, you can downgrade your node one by one without turning down the whole cluster. So it's non-disruptive. And also, we will take care of the conversion of the snapshot files sent from the leader to the follower so that it conforms to the old format. Just want to add one thing about Kubernetes downgrade. This is like, for example, in GKE, the best practice is you don't upgrade or downgrade Kubernetes and SID at the same time. So it's two separate topics, yeah. Last year in the KubeCon Europe, there was an ETCD session where one of the maintainers was kind of ringing a bell that ETCD is not as good as it should be. So now we are one year later. Can you give us honest opinion what is the state of ETCD today after one year? Thank you. I mean, do you have any particular questions or which part? Overall, I think we are going to right direction because we are focusing on the quality. And by having or we are covering things between Kubernetes and ETCD that were some semantic behaviors that were never really tested before and having the stability of or having this defined and tested and stable, we will guarantee that the project itself will never divert or cause any breakages. So I think our current goals with or with that and goals of sustainability, I think we are getting really out of the main problems with the main problems that we were hitting before, which is having one person know everything and depending on them and be really scared if that person leaves or goes, I know, wills a lottery. So now we are democratizing the knowledge. We are testing or putting everything, automating it, and putting it visible for all community, like James work on benchmarking, work on robustness is that everyone can come in and validate everything by themselves instead of depending on single or some forgotten Google documents or processes that we've never written down. So yeah, I think we are getting there. Maybe just to add one very short piece of context to that. So the session referenced was KubeCon Amsterdam on the hunt for data inconsistencies at CDE. So if anyone wants to go back and have a look at that, that's the session you want to look for. And we've made that was talking about the introduction of the robustness testing framework, which has had a lot of improvements since then. There was also the talk in Detroit about the previous topic of losing the knowledge from previous maintainers. So yeah, those two talks. But are we completely staffed yet? No, so we still need help. Yeah, there was a mention of a sync rights for storage support getting added. Is that a trade off between reducing right latency versus availability and data durability or there are no risk once you have a sync rights? Sorry, can you maybe say again? So as part of the raft improvements, there was a reference to now at CDE, no supporting a sync rights. Oh, yeah, yeah. Yeah, the a syncing of the right is the biggest feature on raft. Previously, you know, LCD and the raft communicate with each other via a channel. LCD process all the raw message in a loop. Previously, LCD needed to sync the right hard lock in each iteration before it can receive next message from the next message, next iteration. But now when the feature is enabled, LCD can sync the right hard lock asynchronously. So it can reduce the latency by 20% to 25% and the latency. Yeah, there is no impact on correctness availability. It's just difference that current rights through limitation of the current right throughput are on raft protocol and network communication because you need to wait for work. So by making rights asynchronous, we are moving the bottleneck from coordination to the disk. So it should be at CDE now or with the change, it still will be able to fully utilize right throughput instead of being bottlenecked on your network and coordination throughout raft. Just one thing to add. This is official on raft. The raft has an ability to let the application to asynchronous process the right hard lock. But this feature hasn't been integrated into LCD yet. Yeah, we are planning to integrate this feature into LCD in 3.7. This is the plan. OK, so the rights to storage and both maybe are still synchronous, but it's the raft protocol itself that is now becoming asynchronous as far as coming. Yes. Thank you. Thank you.