 Hi there, I'm Alex Semenos. I'm a software engineer at Google. I work on the embedded controller with our Chromebooks. I'm here to discuss how we're using PWeek's tokenizer as our embedded controller. First, I will explain what PWeek's tokenizer is, and then you may be wondering how this differs from Zephyr's dictionary logging. So I'll give you a brief comparison between the two. Next, I will discuss how it helps us. Then I'll show you how you can integrate it within your Zephyr project. Finally, I will talk about future improvements we are planning to use tokenizing within EC. So what is PWeek's tokenizer? It's not a string parser, like S-T-R-Toke. It's a compile time tool that will replace string literals with a 32-bit hash token. You may be asking yourself, why would one do this? Some benefits are this. A list are reducing your binary size as you're removing string literals from your binary or replacing it with the 32-bit token. It would also reduce IO traffic, RAM, and flash-based usage. You can also reduce CPU usage by replacing printup calls with simple tokenization code. Mainly, this will offload the string parsing from your board to another device that decodes the tokens. How does this compare to Zephyr's dictionary logging? First, the string mapping is different. With Zephyr, the string address is paired with the string. The tokenizer recruits a 32-bit hash generated from the string literal. With a hash algorithm, you may be concerned with collisions. So with a 32-bit hash, you'll need to tokenize about 9,300 strings to have a 1% probability of a collision. 77,000 strings for a 50% chance of collision. As Zephyr uses the address of the string, there's no chance of a collision at all. Zephyr uses a JSON dictionary as the database format. Pigweed supports CSV, binary, and directory-based file formats. The usage of these different formats depends on how you want to manage your database, which I'll discuss in later slides. With the database portability, the limitation with dictionary logging is that the string addresses are not guaranteed between builds. The database is only compatible with the build it was compiled against. This is where we prefer Pigweed's hash implementation. If the string content remains the same, the hash algorithm will generate the same token ID, so you'll be able to use the same database with any build. You can merge databases from multiple boards and versions to have a global database that will work with any of your products. As I mentioned in the previous slide, Pigweed's token database consists of a 32-bit hash-based token ID, the text stream, and optionally a removal date. This occurs when the string has been removed from the build. All of this is generated from the L file. There are three database format types supported by Pigweed. First is CSV, which is a common separator file. This format is helpful for debugging purposes as it's human-readable. Next, you have the binary format, which is more compact in CSV database and saves you space. And last, you have the directory-based format. Pigweed can consume a directory of CSV databases. A directory will be searched recursively for a Pigweed tokenizer CSV suffix. This format is optimized for storage in the Git repository alongside the source code. The token database command will randomly generate unique file names for the CSVs to prevent merge conflicts. Pigweed gives you the capability to update existing token databases. You can merge in multiple databases into one. You can append to existing databases to have a historical database. When a string is removed from a build, the database will keep track of the removal date for later purging if needed. Pigweed offers APIs for CMake and GM builds to help you integrate it into your build process to create token database artifacts. All right, now that we understand what's in Pigweed's token database and how to create one, let's use the data to tokenize our strings. Pigweed Detokenizer API is currently supported in three languages, Python, CC++, and TypeScript. They have some examples available for you to connect your board to view tokenized logging output. One is their web console, which is pretty cool to be able to flash your device, open a web browser, and see your login displayed in it. Pigweed also has a terminal-based system console tool written in Python to view your tokenized logs as well. They have some examples in CC++, which you can find in their documentation online. So let's go over an example of tokenized logs to see the space savings impact of tokenization. Here we have a logging statement of the battery state in the current voltage. The state argument is a string and the voltage is an integer value. When you compile the list for regular plain text logging, the binary row can contain this format string, which takes up 41 bytes. The logging module will expand the arguments to create a final string that it will transmit to your terminal. As you can see here, the battery is charging and has a voltage of 3989 millivolts. 49 bytes are transmitted in this scenario. Now let's see how this changes when you enable tokenized logging. The source code contains the same logging statement. When enabling tokenized logging, this will be substituted with a hash 32-bit token ID taking up four bytes. When the device transmits to your terminal, it will send a four-byte token, nine-byte charging string argument, and two-byte voltage integer value. The total bytes sent over the wire is 15 bytes. Both display the same logging message to the end user. You'll see that the tokenized logging sends approximately 9% in the binary size, reducing 41 bytes to four, and 70% in the encoded size for transmission. Additionally, the expansion of the log statement and processing is offloaded from the board to the device running the tokenizer code, which can be helpful for time-critical applications. So how is tokenizing helping us? We're applying tokenizing to our logging module in the embedded controller. This is reduced to EC image size by 14 kilobytes, 6% reduction in image size. This will allow us to add more features in the future. With tokenization, we can be more robust in our logging. I'll no longer need a magic decoder ring to parse the meaning of numbers or some short-handed debug statement. Tokenizing your code is pretty simple to do with Pygmy. One area that you should take some time to decide upon is how to manage your token database. Do you want to keep it simple and pair the database per device, or do you want the database to support all variants of your device? The trade-off here is a science of the token database and how much it will grow over time. For Chromebook EC, we decided to use a global historical database to support all variants of our boards. Chromebooks generally have an eight-year support cycle, so we'll be able to use one database to support all our Chromebooks. Having a global historical database, a developer will be able to view logs on any device with ease. I worked at another company where I needed to find the correct database to load the logs, which added some friction to the debugging workflow. I think the approach we're taking with the EC will make debugging flow smoother. Some metrics we gathered determined this approach were to figure out how large this database was going to be. For a single board on its own, it was around 40 kilobytes. Verging a token database together of another 20 boards, it grew to around 66 kilobytes in size. This is approximately one kilobyte of growth per board. You may be wondering now, how do I integrate tokenizing into my project? It's actually quite simple. You'll need to sync pigweeds SDK, includes pigweeds zephyr kconfig, then enable a few kconfigs, C++ for zephyr, the zephyr pigweed module, and pigweed log tokenize. You'll then need to update your CMake dependencies to create the token database, and you're ready to kick off your compile and it should be good to go. A few other areas that we're planning to leverage tokenization in EC are with tokenizing some of our static streams, such as state names to help capture state transitions and enumerations so I can dish on my magic decoder ring. We're also looking to tokenize RPC logging to further offload logging process to another thread or device. Thank you for attending my talk. I hope you found it informative and helpful. If you have any questions, please do not hesitate to reach out to me on Discord. Thanks again and bye.