 All right, welcome everybody. This is debugging Android devices in the field. I wanted to start off and say that there's going to be lots of helpful links and things throughout the slideshow. So I've included a quick URL and QR code that can take you to the actual presentation. And you can access all the links there. So give a few seconds to anyone that wants to get a quick picture of that or scan the QR code. Cool. So a little bit about myself. My name is Chris Hayes. I've been working at MemFault most recently. And for the past nine years, I worked at Square on basically everything Android. I started out in just generic app development, working on their point of sale apps, doing UI tests, feature development, things of that nature, and then quickly went into working more so on their build systems, where I optimized their Gradle builds and also worked on transitioning from Buc to Bazel. And then finally, the last three and a half, four years of my time there, I worked on their hardened Android operating system, which they called Squid, which is Square Android. And now I am working for MemFault as one of their Android solutions engineers, working on their SDK and helping customers integrate and add new features to their device observability. MemFault is a global company. We have offices in San Francisco, Boston, and Berlin. Myself, I actually live in Colorado and work full-time remote there. So today, we're going to do a diagnostic overview of the latest tools in AOSP. And we're going to go through basically three different areas. We're going to look at logging infrastructure within Android, then the diagnostic tools available to augment and make it easier to understand what's going on with those devices. And then finally, we're going to talk about how you can remotely observe your devices in the field. So to get started, we're just going to start with the basics and talk about Logcat and the NK message. So Logcat is a very straightforward logging infrastructure. It's mainly aimed at the Android runtime and has five different buffers that you can access for the main system, crash, radio, and event logs. If you've done Android app development, you're probably very familiar with Logcat already. It's one of the first things you basically learn. If you're just getting into AOSP development, then you have access also to the kernel logs, which are found at slash proc slash K message device. And you can also use D message to actually read those logs at the command line of the Android device that you have. D message provides a lot of colored output and filtering options to really make it easier to understand what's happening with the device. So a lot of people are familiar with Jake Warden's Pidcat tool for colorizing Android Logcat logs. But I found with working on AOSP, LNAV to be a much better tool to use. So LNAV is a generic log navigator binary that you can use on the CLI. It supports log interpolation. So you can actually combine both the Logcat logs and the kernel logs into a single view and see what's happening with the system as a whole and have the timestamps interleaved with each other. So you don't have to keep jumping back and forth between kernel and Logcat logs. When I was working at Square, we actually had a product that was two discrete Android devices. So that meant that there were actually four sets of logs. And so what I ended up doing there was I wrote a script to prepend the different device IDs to each log line and then was able to not only see the Logcat kernel logs from a specific device, but from both devices interleaved with each other to see how they were communicating back and forth. This made it super easy to understand when there were network communication errors between the two devices or understanding state that was being relayed from one device to the other. It also supports syntax highlighting and custom projects highlighting. So if you have some sort of specific message that you're looking for, if you're debugging some issue, maybe in the runtime, you can easily create a regex to look for that. And then when you're scrolling through the logs, it'll be nice and highlighted. And then finally, it does also do pretty print of structured data for you. Now, one thing I want to mention is that out of the box, it actually does not support Android Logcat logs. But at the base of the slide, you'll see that there's a link to my GitHub repo where I've provided a custom Android Logcat schema that you can install to Elnav to make it work correctly. So next, we're going to talk about the different types of crash data that you can get from a device. So tombstones are basically very detailed crash reports that you have access to. They're generally logged to slash data slash tombstone. And they will give you all of the stack traces, a memory map, open files, et cetera, of the application that crashed. Next, we have kernel panics. So a fatal exception happening in the Linux kernel. If you're already familiar with Linux, then this is pretty straightforward to you. You can, it often leaves a tombstone file. So you can use that information to understand what caused that actual kernel panic. Now, there's also kernel and RAM oopsies. They're generally serious exceptions, but they're not always fatal. If they do end up being fatal, they're usually followed by a kernel panic. So you get both sets of information in that case. Now, further up into the Android runtime, you have ANRs or application not responding. This is any time that your application blocks the main thread. And the reason that this is important is because if you're blocking the main thread, the Android runtime is unable to actually accept user input. And therefore, your user believes that your device is frozen and not responding. So generally, this is because you're doing either network or file IO on that main thread. And all that work should be moved to a background thread. Next, we have WTFs, or what terrible failures, according to the AOSP documentation. But I think we all really know what they stand for. These are used for situations where something really shouldn't be possible and will show in the logcat the WTF verbosity level. They can be fatal. They don't necessarily, they're not always, though. There are several instances where AOSP has put WTFs into their code bases, and they can be ignored at times, or they can be actually indicative of a real problem. Java exceptions, again, pretty straightforward. You'll get a stack trace of the exception, and you can basically step through each of the, the different frames of that stack to understand what was happening. Finally, there are SCLinux policy violations. SCLinux defines the permissions to processes, applications, files, and other resources. And it's one of the mechanisms that Android uses to create a secure runtime environment. So in user debug and engineering builds of AOSP, SCLinux is generally set to permissive, so it will only log the denial, but it'll still allow usage of that resource. Then when you're in a user build, it's defaulted to being enforced, in which case that can trigger a crash if your application is not expecting that denial. So it's generally a good idea to actually keep track of those denials when you're doing your development phase and address them when you see them. So from there, we have different ways of getting that information from the device. So first is the Android debug bridge. This allows you to connect to your device over USB or the network and run commands, pull logs, install apps, reboot the device, and push and pull files. There's a lot more than it can do, but those are generally the most common things that you can use it for. And it's just a very powerful tool with interacting with the device. Next, we have bug reports. These are generally huge dumps of information. Bug reports trigger dumpsys, which iterates through every single running service on the device and calls its dumpsys handler, or the dump handler, excuse me, and will give you a wealth of information for every service running. It can be from network usage to the state of the Wi-Fi, cellular chips, any custom state within your services. It really has everything in it. The caveat here is since it is grabbing so much information, it's very taxing on the system and can cause the device to appear frozen or just very unresponsive in general. So it's not something that you want to be triggering on a regular cadence and really should be saved for when you have a device in front of you and you're trying to understand everything that's going on. There are usually other ways to also have a user trigger it from the settings application of your device if you're using more stock Android, or there's callbacks that if you're writing your own settings applications and things like that that you can use to trigger that bug report. And then next is Dropbox Manager. This is basically the API that you can use to grab a lot of the different types of crash reports we just talked about. And it allows you to retrieve those files and either store them to maybe external disk or maybe send them to a remote backend. So as we transition more into the diagnostics, one of the next things that we have is battery stats and battery historian. Battery stats collects very detailed information about anything that is drawing power on the device and will log things like wake locks, file IO, network usage, Bluetooth state, wifi state, cellular, all of those different things that can be very taxing to the battery and pulling a lot of power. And then battery historian is a web app that Google provides that you can run locally on your laptop. And it consumes a bug report and gives you extremely detailed graphs and timelines of that bug report to help you understand what things are actually happening at that time of maybe a high power spike. So next we have performance monitoring or tracing and monitoring. Perfetto is a tool that Google provides that does recorded traces of the device under test. It captures high frequency F trace data, scheduling, task switching, CPU frequency and honestly so much more. There is just a wealth of information that Perfetto can capture and it can actually be pretty overwhelming the first time you start using it. So generally what I tell people is make sure that you have a well-defined issue that you're trying to understand, find it in the timeline first and then start exploring the data from there because you can very easily get lost in the amount of information that it provides. And then in terms of doing some monitoring, a colleague of mine at Square, Pierre Eves wrote a library called Leakinary that you can include in your Android apps which will detect memory leaks in the Java application or service. It has a built-in heap analyzer as well as it allows for uploading those heap analysis to third-party services. So what is a common problem with the existing tools and information? They're basically all great if you have the device right in front of you. It's if the device is in the field you still need a way to get that information back to yourself for analysis. And so this is an area where Memphal can step in and help. Kind of a quick overview is this is your general release cycle, if you say, where you start at basically the design phase. You start writing your code, test and then you do your release, your actual deploy to a device. You then observe it, analyze the data and then provide feedback to your design and engineering teams. So our goal is that after you integrate one time the board SDK, then the next release cycle, you'll get a lot of this information back to your engineers for basically improving the stability of your devices. We focus on the right-hand side of this cycle. So doing the release process, doing actual OTI updates and then device and fleet monitoring as well as analysis of that data. Just want to give a quick overview of the Memphal SDK for Android. The heart of it is the Memphal's board app. It connects and communicates with the Memphal update agent for facilitating and scheduling OTA updates which can be either full updates or differential updates. And then it also connects to the Memphal usage reporter app which collects all the different types of log files and crashes that we talked about earlier. It also has the capability of triggering bug reports if needed, but again, as I said before, it's not something that you want to be doing on a regular basis. And then in a moment, we're going to go through a live demo where we take a look at the fleet dashboard, individual devices, and then some of the issue tracking. Make that a little bit bigger for you all. So from here, you can see the number of active devices over time. This is useful to understand if devices are actually communicating with the internet. I find this really useful that it's coming to our backend versus your own so that if you do have an issue with your own services that you're running, you have a second reference point to understand how many devices are actually online. So you can see, okay, well, Memphal says there's 28 devices currently active, but I only see 15 check-in on a daily basis. You can start asking your server team to understand why that's happening. You also get a chart of the rate at which software is being picked up over time between different versions. So you can make sure that your actual release is being adopted by your users. If you have automatic updates set up on your device, this is helpful in understanding that that process is working correctly. If it's something that needs to be manually triggered by a user, then this can help your team find ways to incentivize your users to update more quickly. And then we have an overall breakdown of different versions in the field. Super helpful if you have maybe a endpoint on the server side that you want to deprecate and you want to know how many users it would actually affect if you were to take down that endpoint or something along those lines. We also track the number of traces each day that are coming in and what those top issues are. Traces can be really any sort of, I call them negative events. They're generally some sort of crash, whether it's a job exception, a native exception, kernel panic, WTF, anything along those lines. And then you can see the actual top five issues that are affecting your devices in the field right now. Next, we track all the different types of reboots. So you can differentiate between a reboot caused by an OTA versus a user-initiated reboot or something from, let's say, a kernel panic where the device was forced to reboot. I really like having the number of user-initiated reboots here because sometimes your devices can get in a state where they're not necessarily crashing, but they may still be misbehaving and they may not be actually creating any type of trace events that would be easy to track. But if you see a high spike in the number of user-initiated reboots, that can be a signal that something is going wrong with your device. And then finally, we have the newest issues on a specific device. So if you're doing a rollout, you can see if you just recently introduced something new to your devices in the field. Next year, I wanted to kind of show some of the metrics collection. Metrics can be any sort of data that is important to your organization. We have a lot of things that are already instrumented out of the box. And one of the powerful abilities of this system is the ability to compare different versions over time. So if we set up something like this where we can see the battery discharge rate was a pretty nominal state and low, but then we released version 1.1 and we see that there was a huge spike here. So our engineers went hard at work to figure out what regressed and what the cause of that was and we're able to validate that their fix actually resolved the issue in the next release. And then finally, if we go into an actual device view, we'll go into this one right here. We can get very high detailed information about the devices. You can see when it was first seen by Mempholt, as well as the last scene. So you can see how quickly it's checking in, what cohort it's in. So you can group devices into alpha, beta production cohorts or whatever is meaningful to your organization. You can also just add custom notes about it as well. If this is like a device that's maybe in your development lab, you can annotate that. Then when we come down here, we start getting into the timeline view. This is an example of a lot of the battery metrics that we've instrumented at this point. So you can see trends in the data where a device was plugged in and the battery started to charge again. You can see what was running on the device, whether it was dozing, if GPS was on, how strong that signal was, how many actual jobs were running on it, screen brightness, everything that you want to know. And so if you find an area of interest, you can zoom into the data more clearly to understand what was going on. When we, nope, we don't have traces here. I'll go to another device in a second and show that. Within the attributes, this is data that doesn't change all that often, generally. There are usually things about your device that you want to know on a high level. It's not something like, you don't wanna log what the current battery level is, for instance. That's not important to understand as an attribute of the device, but you may want to log what color the device is or what locale it's locked to. We also support linking multiple devices. So as I was saying with Square, when we had a product that was actually two different devices, you can jump between them via this links attribute here. And when you do that, you can actually look at the timelines in parallel and see what's happening on both devices at the same time. We collect all of the different log files for you and you can do either continuous logging, depending on if you have something that has a strong reliable network connection, or you can have it only grab logs periodically. For instance, all the logs leading up to a crash of the system. And then again, we log each of the reboots here so you can see when they actually happened and you can jump to them in the actual timeline view. And that's gonna jump into, this is actually for MCU devices, this project. So we do actually support Linux, MCU and Android, but I wanted to do a quick overview of the traces. So if we look at a specific trace events, you get all of the threads associated with it and log messages that were logged to the device. So if you want to learn anything more about AOSP tools and mem faults, we have a couple of different webinars available. And I also wanted to mention the AOSP and AOS meetup group. It's led by Chris Simons. He's actually right there in the middle of the crowd. It's very remote friendly and it's talking about all things AOSP and AOS. And thank you. And I'll leave this slide up again if anyone still needed to grab that. Do we have any questions? So how would you handle the, if you can only use mem fault if you have the ability to update it on the device itself, right? But what about just tracking the app crashes themselves? So if you don't have the ability on the device itself. So you're talking about more so tracking just application level crashes? Yeah. So honestly, if you're looking at only application level crashes, I would look at one of the other services such as like Bug Snag or Crashlytics. They're geared more towards one app's specific crashes and understanding what's happening there. Whereas our goal is really to instrument the entire system and see it as a whole. So honestly, like their services are very good at tracking a specific app. Thanks. Yeah. Yeah, thanks. One technical question. So Memphold is living as a system privileged app or how is it implemented in the AOSP tree then or in the device? Yeah, so if I go back here. Yeah. So it is a privileged app on the device. The Memphold SDK, you would add to your repo manifest and pulled into probably the vendor Memphold part of the tree. From there, it's basically a simple wiring of Android make files to pull in the compiled apps. From there, we set up all of the SE Linux policies in our SDK. So really there's very little that you need to do. Honestly, the first time I implemented the SDK it took me about five minutes to wire up the actual repo and config some API keys and things like that. And then the build took 15, 20 minutes. So the build took significantly longer than the amount of time it took me to actually wire everything up. Correct. Yes. And then you can see here the Memphold usage reporter for instance is a system app on the device. But both the update agent and the board application are both only privileged services. And we're actually in the middle of working on something that is potentially even just installable on a generic Android device that does not require wiring up into the actual Android build system. But we're still kind of working out some of the kinks with that. Hi, thanks for the talk. Just to clarify then, this is not open source but is there a free version that we can try out? Yeah. So the SDKs themselves are all open source but the backend and the web services that you saw me demoing are you're correct are not open source. There is a free trial version that you can use for up to 10 devices. And I should have included a picture of that in my slide deck. But I can find that information for you and I'll add it to the slide deck online after the fact. Thanks for the talk. Can you tell us a little bit about the update service and how does that work? It seems to be like a parallel thing to what you demonstrated, right? Yeah, it's parallel but we consider it part of our core offering because we really want to enable the whole right-hand side of this graphic here. And we want to be able to understand the differences between two versions. So here the Memfault board application does communicate with the update agent to basically identify what cohort a device is in, send that information up to the server and then the server determines which specific payload to provide it. So this can be useful if you have alpha, beta and production groups. Device may be part of the alpha group and therefore it gets bleeding edge software. Another option is if you have cohorts that are set up maybe by specific feature sets that a device has. You can distribute different software depending on that device. So we use the native Android update engine in AOSP and provide it basically the tar ball of images that would be flashed to each of the AB partitions. So the update agent is really focused on the grouping of devices, understanding what actual software should be installed to it and giving feedback on the state of the update. Any last questions? Oh, thank you.