 Hi everyone, and welcome to the Zephyr Developer Summit. My name is Yuval Pires, so I'm gonna talk to you a little bit about high bandwidth sensors within Zephyr. I have some contact information down there. If you need anything, please feel free to reach out. So we're gonna talk a little bit about what are high bandwidth sensors, first of all, and kind of why should anyone even care about this? We'll then discuss the evolution of sensors in Zephyr, moving from a blocking API to this asynchronous one and kind of what does this enable us to do, and then finally the new features, which are the streaming sensor APIs that are gonna be coming soon. So high bandwidth sensors can be defined in a lot of different ways to a lot of different people, right? We could look at them as how many samples per second they produce, how many bytes of data per second they produce, but really what we care about is just it's any sensor whose data processing and pipeline is a bottleneck to our application, right? So the existing APIs effectively are these blocking calls. We first make a fetch call from the application to the driver. The driver then performs the bus IO in order to read the data that was requested, and during this time we're actually writing data to a buffer that's owned by the driver. And the application thread is blocked on this fetch. Once the fetch is complete, we can read all the data from the driver's cached buffer in order to process it and do whatever the application actually needs with it. And there are a couple of problems with this. The first is obviously that the application is blocked during this whole time. The data processing assumes that the driver is locked, and what this means is once we've started processing data in that loop at the bottom of the application thread there, we read maybe the first sample or the first channel X, and then we could get an interrupt or something could happen and we switch our thread context, at which point maybe another thread is performing another fetch on the same driver. When we come back, the value of Y could be from a completely different sample. We don't have any guarantee that when we read multiple channels, they are all from the exact same sample. This is really side effect of having the memory owned by the driver. So looking at what this changes to when we go to an asynchronous flow, we get these non-blocking calls during the bus IO thanks to the RTIO subsystem. Data processing does not require anything really from the driver's cache. This is done because the application now owns the memory and having that application owned memory gives us a lot of flexibility that is generated by what we're going to use in this talk as a mempool and we'll dive into that. One of the other interesting bits about this graph or this chart is notice that the decoder is actually never blocked and the decoder is 100% stateless. It is associated with the sensor type, meaning the compatible string and it can technically be instantiated without a sensor being connected. We'll dive into that a little bit later. So how do we enable this? First, we need to enable sensors and then we're going to enable the kind of experimental sensor async API. So there are two modes of reading and that's where we'll split this talk into. We have this one shot data, which is effectively an asynchronous version of the existing APIs. The alternative is the streaming data, which will touch on the second half of this talk. So the first thing we need to do for this one shot sample is set up a reader and we're using the RTIO IO dev in order to do that. If you're not familiar with the RTIO subsystem, please read about it a little bit on the Zephyr documentation page. It is definitely worth your while. It's improved our performance quite a bit when processing these sensor samples. So this macro allows us to statically create this reader using device tree and you can see here we're giving it a name. Under my reader, we use the DT chosen for the lid accelerometer in this case in this example. And then we're only trying to read the channel XYZ, but you can provide any number of channels to the IO dev here. Just add a comma and add the next channel and so on and so forth. For the processing context, we're using RTIO with mempool. We'll dive into that in a little bit, but overall the mempool gives us a lot of flexibility. The name is given as the first parameter and then we have two parameters for the submission queue and the completion queue followed by the mempool configuration, which is the number of blocks, the block size and the alignment, which is used for DMA access if we need to. So what is the mempool buying us? What are we actually getting from this? And one of the biggest things is now that the application owns the memory, we can delay the processing. Now the mempool is not necessary, right? You could use an RTIO context and for every read provide your own buffer. The mempool just makes it a little bit easier and allows the allocation of memory to be delayed until the sensor is actually ready to do the fetch. So we can also control a little bit about how the memory is managed, right? So a small block could be used for one-shot reading. So if we configure the mempool to have really small blocks, that would make sense if our sensors are always doing one shot. So one frame or one time snap shot per reading. But if we're using a hardware FIFO or we're looking to do kind of batching processing, maybe larger blocks make a lot of sense. And if we want to do a mixed set of one-shot and streaming data, maybe we want to just have a lot of small blocks for this mix to use. So how do we queue the read? Well, there's a very simple sensor API function that was added, a sensor read. You need to pass it a pointer to your reader, a pointer to your RTIO context, and then finally the any optional user data. This could be null. In this case, we provided the device pointer, which is the sensor number of the reader. Processing the data is done shortly after. We provided a sensor processing helper. You can look at the implementation details on the source if you want to do custom processing, but this should provide most of what you need. It is a blocking call and will trigger my callback when there's data available, and it will automatically free the memory once processing is done. Now, in this case, we put it in the same thread that we're going to use. We put it in the same thread as sensor read in order to mimic the original blocking APIs, but in reality, this doesn't have to be where you could have a processing thread where all it is is a while loop that just calls sensor processing with callback over and over again. So the implementation details is really we block and wait for the RTIO CQE, which is the completion queue event. Once we get it, we copy out the information that we're going to need from the CQE so we can release it right after. And releasing it just allows the RTIO subsystem to continue processing new events. Once we've released it, we can call the callback because we have a copy of all the data we need. And when the callback returns, we're going to release the read buffer back to the mempool. And a couple of improvements we're planning to add. We want to add more helpers to complement this processing with callback. So for example, one could imagine a sensor processing blocking. And this would effectively allow you to do the processing inline. The only thing you would lose there is the fact that you would have to release the memory yourself once it's done, once you're done processing with it. The next thing is we're looking to provide a couple of common API tests to verify that all the decoders that are being added follow the guidelines for decoding, meaning that the shift operation will get into that. The shift value is the same for all axes of a similar measurement type. There are a couple of other guidelines for writing your own decoder. But this might sound a little complicated and might be asking why do we even need the decoder? Well, when you get the data right now, it's stored in this RTIO mempool, and it's generally speaking stored in a raw byte format, whatever the sensor provided. This allows us to kind of batch process all these samples on our own terms. Effectively, maybe delayed thread. Maybe we want to have the user even trigger when we want to process this. And the final thing is it means that we can actually also get the decoder statically, because it has no state. It's not, it's associated with the compatible string and not with the actual instance of the device. We can get it that way. And this overall means that the processing doesn't even need to happen on the same core. So here's a kind of simplified view of what could happen. We get a bit of new data and we have three options of what to do with it. We can save it maybe to some persistent storage, as is raw buffer. We can send the raw data over wire or we can process it in a separate thread or at the same thread on the same core. Now in the first two options, when the other end, maybe it's a different core, maybe it's a different board entirely, is ready to process, it's going to read the data and it's going to get an instance of the same decoder. So as long as the two cores are running the same version of Zephyr, these are guaranteed to be compatible. And in addition to being able to get it statically, we can also get the decoder at runtime if you need to. In this case, we have one handler for the callback that uses the user data, which is a device pointer. And you can see here sensor get decoder will get a pointer to the decoder. The cost of this is basically a system call if you have user threads enabled. Once you've got the decoder, though, you can start processing the data from the buffer. So here we have an example of getting the timestamp. And the timestamps are in nanosecond units, although if your core doesn't provide nanosecond accuracy, obviously you'll have big steps in there. So the rest of the data that's actually in there, the channels, is maybe a little hard to visualize at first, but we're going to take a couple of steps through this. So in this example, we have an accelerometer running at 200 hertz and a gyro and sampling the dye temperature every 50 hertz. And what you can see is that effectively the accelerometer gets four samples for every one gyro and dye temperature. So one frame here is a single snapshot in time. So going back, you can think of each row as a single frame. They were collected together. They are assumed to have happened exactly at the same time. Obviously some details can be there. There's some wiggle to that. The samples here we can imagine if we have a sensor sampling at 100 hertz and we're going to collect over 20 milliseconds, we'll have two frames. Now it is important to note that for one shot reads, only one frame will ever be present when we're decoding. So two of the key arguments here for the decoder is this frame iterator and channel iterator. And both should be initialized to zero before decoding starts. I highly recommend using the curly bracket zero for this because while right now, you could just set it equal to zero because they map to an int. They, there's no guarantee that that will stay the case, right? They could, we might end up changing this. This is an experimental API. And if we ever change it to a struct, this will be forward compatible with that. The return values for the decoder are relatively simple. It's negative values for errors, zero if we're done decoding the frame or the buffer, the entire buffer. And if it's greater than zero, then it's the number of channels that were decoded. So you can kind of see here, this is how we can tell when a frame ended. In this case, we have the frame iterator and the frame iterator previous allocated both to zero. We pass the frame iterator and we compare it here. If the frame iterator changed, notice that we're not doing a less than or a greater than. It's just if it's different, then we know a new frame started and then we update the previous value. Now all the data that's being decoded comes out as Q 31 types. And if you're not familiar with it, there's also some documentation on the Z DSP DSP stands for digital signal processing library or I should say subsystem. And the goal of that subsystem is to provide a common front end that you can use and allow each architecture to provide optimized implementations of the DSP library. So overall, this is a fixed point fractional value on a range of minus one to one. So you can imagine a Q 31 basically into 32 max is representative of one and in 32 men is a minus one. And then along with this Q 31 data, you have a shift value. So shift of zero doesn't do basically anything. And a shift of one will double your range. It is a left shift. And then a shift of negative one meaning a right shift basically will reduce the range and kind of effectively increase the step, sorry, decrease the step function of which the accuracy goes by. The shift value is provided by the decoder get shift. And the guarantee that the decoder contract is that the samples all the samples are in the same in the samples are in the same buffer will always have the same shift. That means that if you have multiple instances, multiple frames of accelerometer X, the next frame is also going to have the same shift value. You can't have different shift values for different frames. And additionally, accelerometer X and Excel Y will always have the same shift value. This really this guarantee kind of makes using the DSP library a lot more handy since you don't have to re align all your data back to so that you can do the arithmetic with it. So here's a calculation sample. We're not going to really dive into but what it does is it assumes we're getting an accelerometer X, Y, Z and we're going to just calculate the magnitude of the vector. So, you know, we're going to take the square of each one of these using the Z DSP multiply Q 31. We're going to do a saturating sum on X, Y and Z. And then we're going to take the square root of that value. Keep in mind, the square root is not yet ported, but it is scheduled for the next quarter, basically Q3 of 2023. So we'll have that soon in the DSP library. Diving into the streaming data next. This is a brand new feature that we're going to be adding also in 3.5 for Zephyr. And, you know, we can think of streaming. There is, I guess, a bunch of different things that could happen with streaming. One is events, right? So a step event, significant motion, tap events, things that maybe may or may not be correlated to specific data, right? A tap event just happened. There's no data associated with it necessarily. Others might include data, such as the FIFO watermark. Well, the hardware FIFO might include data and might want to know both that it happened and what the data is that's associated with it. So setting up the stream reader is very similar to setting up the one-shot reader. We need to give it a name. We're going to give it the kind of the device tree node that we want to use. And then instead of the channels, we're going to use the sensor stream prep. And this is going to take the trigger type. We'll go that into there. And it's basically the sensor trigger type. And the second one is the guidelines of what to do with the data once you once you've got the trigger. And the options are to include the data, meaning put it into the kind of memory pool or the memory buffer that was associated with the request. We can drop the data. So in the case of a FIFO watermark, we might actually want to flush the entire data and get rid of it, meaning only the event will be reported to the client. And no op, meaning just tell me that the trigger happened and don't do anything with the data, leave it on the device. Maybe we want to do something on our own with that. And starting the stream is as simple as calling sensor stream, just like sensor read. The only difference is we have an optional handle pointer that we can pass in. And the handle pointer is used to canceling or stopping the stream right there. So we've already talked about decoders and now we've added these triggers. And the trigger is part of the header, meaning there are no frames of triggers. They're associated with effectively the, they're always associated with the first frame that is going to be coming to you. And in this case, what we're doing is we're going to batch process up to five triggers at a time. We know how many triggers were done, were fired, and then we're going to print them out. And as long as the number of triggers is, if it's negative, we're going to step out. And if the number of triggers is, sorry, if it's negative or zero, we're going to step out. And if the number of triggers is less than five being read, we're going to try and reading again. So sorry, I stepped backward. The overall summary of where we're going with sensors here is we're using RTIO and specifically mem pools for some control over granularity on this. We're removing the inter-processing away from the sensors, right? So previously when the sensor got the interrupt, it owned the thread or passed everything to the system work queue. And owning the thread wastes a lot of memory in RAM, sorry. And then using the system work queue gives us a very uncontrolled environment for timing. One-shot streaming, data paths are effectively the same, sorry, one-shot and streaming data paths are effectively the same now. So you don't have to worry about understanding how data is going to get to you. You effectively, you give it the callback, right? You have the processing thread, you give it the callback. And magically, regardless of whether it was a one-shot event or a streaming event, they all bubble up into the same processing function or the same completion queue event. And this means that we have also finer control over what do we do with the trigger when it's detected, right? The current APIs don't allow us to pass that information yet. When the trigger happens, it's kind of up to the driver implementation to decide what to do. And if you have any questions, please reach out. My contact info was on the first slide and I'll be happy to answer and address anything that you might have. Thanks. Take care.