 Hi, everyone, and welcome to the Zephyr Developer Summit. We're going to talk a little bit today about using flexible system design with RPCs and really embracing the distributed computing concepts in Zephyr. Origin of this talk really came from a real life experience that we've just experienced on Chrome OS. We have these two separate processors, the EC and the application processor, the AP, and we wanted to make better use of our resources that existed. And one of the things is that many of the application processors, both from Intel and AMD, for example, there are others, come with dedicated sensor cores. And what we wanted to do is move the sensor logic over. And we found that to be a little more difficult than we hoped for. There were dependencies that were just kind of built in linker problems, tests, and just overall prior design assumptions that were broken during this process. And the question we wanted to ask ourselves here is could we have done better? Could there have been things that we designed that would have made this an easier transition, made it more extensible? And the answer was yes, and we're going to talk about it here today. We're going to talk a little bit about what is a portable design. We're going to talk a little bit about what the solution that we came up with that is a feasible one, such as the Pigweed RPC and using portabuffers. And then we'll talk about how we can transition from headers to services, how our mindset can be kind of manipulated a little bit into this new paradigm in the embedded world. And then we'll actually build a very, very small and simple example. So starting off with the portable design, the kind of spoiler alert is they're really microservices, which is a very unique thing to come into the embedded world with this. In our example, we're going to be focusing on the motion sense task that the EC had. But there were other tasks that were identified as possible candidates for something like this. And so you can see here the power delivery task. And the real question here is, does it matter where these tasks actually live? This logic can be anywhere as long as the IO exists and the bus is laid out such that the information can flow. And if we wanted to break down the design, we can think of the proper design here as having the right abstraction layer at the right place. And once that's done and once the data channel is abstracted correctly, our services and our clients can be unchanged. When we move things, when we change things, they'll be compatible with possibly different versions of each other even. And that's where RPCs come in. They're originally designed in order to be extensible and not care about what the other end does. And Pigweed allows us to do this in the embedded space. And you might be asking at this point, what is Pigweed? Well, it's really a collection of tool. It's kind of like a Swiss Army knife of embedded applications. We're gonna be focusing on just two of these modules during this talk, but there's a lot more of them. And please look up on their web page at pigweed.dev and check out the other modules. We'll very briefly cover our PC concepts. There's a lot more information about these. And again, I highly recommend reading up on these. The three primary modes that you might see, there's a fourth one which combines client streaming and server streaming are laid out here. You can think of it as just out and back. Client streaming is when the request effectively might not have been computed all the way. And we might wanna send multiple of these requests before we get a response. And the server streaming is the opposite. We have one request and then we open a stream of data coming back in that would process it. So a simple service might look something like this. It's completely asynchronous. And in this case, we're gonna set a value of five to the service from the client, get back an okay. We're gonna get the value and we're gonna get back an okay with the data being five. And this it would be non-blocking on the client side. And the transport layer here is completely irrelevant. So one of the questions you might be asking is kind of when to use streams. We won't really dive into this too much but I did wanna touch on it because we brought it up. Pigweed for example uses server streams for logging. So the example is the client connects to the EC, the EC is the server. And when it does that, it basically sends the request of, hey, I'm interested in logs. And at that point, the service will start sending multiple logs as they come in and kind of never closing the connection. It's very similar to a socket connection being open. So the kind of high level theme is basically whenever data is generated with some latency where we don't know how much data. It's not just a simple repeated field of message. So setting up Pigweed into your West Dot YAML is as simple as adding the Pigweed repo. It's already fully compatible with Zephyr. We are making very active changes. So it's recommended to kind of sync frequently and bear with us as we tweak K configs and things like that. When you wanna set it up, the first thing you'll need to do is enable C++. Most of the modules in Pigweed require C++ 14. In this sample, we are enabled 17 just because it encompasses all the modules and we don't have to tweak it at that point. The next thing is find your Pigweed module and then enable whatever libraries out of Pigweed that you're interested in. We have in this example, a couple of them listed alphabetically cert based for 64 and so on and so forth. So let's visualize the data flow of the RPC, right? The first thing that happens is the application sends the request to a client. The client basically injects these IDs of the method and the service and passes that over to the actual RPC client, which is a router in a sense. It adds this routing header over to the channel. Sorry, serializes the RPC into a wire format. And then finally, the channel output puts it into the right frame. Some examples of this would be, there's a HDLC channel output in order to write HDLC frames. There is another one that we're gonna talk about here that I wrote specifically for this talk in order to optimize the wire format when there's no actual wire. We're writing to just share in memory. So moving away from headers, right? It might seem like a bit of a stretch. And the real reason here is that protobufers give us a lot of flexibility. They make it easy to test our code because we can also mock our clients or services. We can even use different languages, right? Protobufs can be compiled to TypeScript, to JavaScript, to Python, sorry, and to Java itself. This makes our test environment very flexible and might allow us to run scenarios that we would not otherwise be possible. They're also much easier to extend if we wanna support different versions of clients and services and have them still be able to talk to each other without having to update both at the same time. Having the proto file itself really forces us to think about these API boundaries, right? You can't just expose a global variable through an X-turn in a proto file. So this makes our interaction confined, which later contributes to the ability of migrating the code from one code base to another or one core to another. So, sorry. The protos themselves are also designed to be flexible and extensible. This is much better, basically, than a plain header where you might later on have to think that you realize that you need to add a version field or, sorry, deprecate another field and so on and so forth. So if we were to write a simple set and get value in a header, it might look something like this. We're gonna go ahead and include these empty structs here for the response and the request set value and get value just in order to keep things a little more aligned with the protos, but they're not technically necessary. The function calls here would pass in a request and have a callback in order to make this an asynchronous event. Well, the problem with that is, right, they're kind of hard to maintain. If we add new arguments, we have to make sure both ends are compatible with each other. The API is still missing a lot of features, right? It doesn't take into account the wire format, the transport, how threading is done. We have a callback that's passed in, but we don't have any control of what thread that callback is gonna be called on and it has zero mechanisms for canceling requests. And finally, mocking it is kind of up to you at that point when you wanna test it. So for the example, we're gonna replicate that header in a proto. So the first thing we're gonna need is the request for setting the value and we're gonna use an int32 value here from the proto3 syntax. And we don't need a return value, so we'll just have an empty message that will keep things extensible if we ever want to return maybe the old value in there. GetValue doesn't pass anything for the request and gets back the value right there. Finally, the service is to RPC calls right there for setValue and getValue. When we wanna go and implement it, we are gonna implement this service. We're gonna call it a cache. Seems fairly accurate for setting a value and getting a value. And our header is gonna simply have this. This is our private field down here value and that is the cached value that we're having. The dot c, I'm sorry, the dot cpp implementation is gonna simply accept the request and the response right there and it's gonna cache the value right there, the private number value there and return it on the getValue. So the final step is abstracting away the channel output which is kind of the serializing format. And this usually uses the PWStream modules writer which is another abstraction layer for how to actually put the bytes onto the wire. And it allows us to switch how the service communicates with the client. So the service implementation doesn't actually change. We just attach a different channel output to it, if you will. And some examples I mentioned before is the HDLC one that has the RPC channel output. And then we have a custom one that we're gonna talk about a simple channel output. And the reason is really just to show performance that this is a feasible thing to do even to communicate within threads on the same core. And both of them use the same writer and the example to write to a ring buffer when they're talking within the same core between threads. So what we have here is HDLC we already covered for local writes though between threads, the channel output, the simple channel output is extremely simplistic. It uses four bytes a UN32 which tells us the frame size followed by N number of bytes that are gonna follow. And we have just a wrapping struct that is gonna contain everything. So that all gets written and read from a ring buffer that we wrote it is transactional meaning we can write multiple small packets before releasing that increases the performance a little bit. And it just uses a plain Zephyr ring buffer and the mutics with a con bar in order to signal that between the threads. The main question that might be coming up now is what is the actual performance and how the tests were set up? What we've done is create two different threads. One is the responsible communication between the client and the service. The other is the other way around the service response to the client. On the main thread, we wanted to run 1,000 iterations of setting the data, waiting for the response, getting the data and then waiting for the response and then asserting that it was correct. The comparison here is that we used a control which is going to just write to a ring buffer from one thread and then have a fake service that reads from that ring buffer and it uses a plain struct with the data that was passed between them and then write back to the client and then basically call the getter and do the same thing. The experiment was to use PWRPC in order to do this with the same thing using RPCs and our implementation of a service. We still write to a ring buffer, we still have the transactional properties but it's all done through Pigweed's RPC pipeline that we showed earlier. And one of the things to consider when we see the numbers next is the control is much of an oversimplification that has no priority control. There's a lot of features missing that you might want in a more mature product and some of the RPC, PW, the Pigweed RPC code paths were identified as bottlenecks and they're still being optimized, it's not finished yet. But the upside is if you do choose to use this you'll get these optimizations as development continues on Pigweed. So the control took us 65 and a half million nanoseconds roughly 33 microseconds per call and the experiment took us 116 microseconds per call. This is quite a bit of a penalty I know but you get free upgrades as it comes and really these should be fairly infrequent calls. The next thing is looking at optimization options and we looked at a call graph and we profiled this. They showed us several points, very high value points of improvement that are currently being explored. One is the RPC header that it wraps around the data that's being sent, just serializing the header costs us 21% of that penalty. The additional thing is for the unary RPC call we have a destructor that closes basically the call's connection and technically we shouldn't need it or at least it can be minimized and that uses 38% of the duration. So if we were to optimize those two that would bring the overall RPC cost much closer to 45% or 14 microseconds per call which I think at that point is very much doable and agreeable. So if you have any questions, please reach out to me. I'm available through Discord, through email or just tag me on GitHub if you have any issues with this. I will dive in next. There are a couple of appendix slides I'm gonna show them that show the call graph that we've generated. So this is the call graph for the client to service handler and kind of zoomed in you can see here the benchmark started up at the top here the green one and then we call here process we create the span and it keeps diving down basically into these separate calls. Some of the interesting points to optimize that we talked about earlier is this is the call graph for processing the header which includes the PIGWeek serializer right there and this is the destructor call graph that we talked about for the optimization. I hope you enjoyed the talk. Thank you very much.