 Hello, it's great to be here at Fluncon Europe 2022. My name is Mickey Pashov and I will be presenting challenges of logging on the edge. Let's start with a little bit about me. I'm based in London. First and foremost, I'm a back-end Python engineer, but my curiosity has led me into the world of DevOps and cloud native technologies. I've been working at Metronik Digital Surgery for the last 18 months. To begin, I'll cover Metronik and what it does. I'll get into the current observability stack. I'll then talk about the challenges we faced while migrating to a centralized logging solution. I'll mention some tips and tricks along the way and I'll finish off on future considerations. At Metronik Digital Surgery, we're on a mission to improve surgical care by reducing variability in surgical performance. The way we approach this is by building an ecosystem of products that addresses education and training, surgical data collection, and data analytics. We have a web platform that provides seamless access to surgical video pre and post operatively, data-driven insights for qualitative and quantitative measurements of surgical performance. We utilize AI-driven technology to improve surgical practices. One of the products is called Touch Surgery Enterprise. It is the video management platform that connects to the cloud to the OR. That's where the DS1 fits in. It's an edge device that records surgical video from keyhole surgeries. It uses real-time computer vision to safeguard videos and we're actively working on algorithms to segment anatomical structures and detect surgical tools inside the body. This is all running on NVIDIA's platform along with a custom built hardware and custom operating system built using the Yocto project for embedded Linux. I want to point out that the edge isn't quite the same as on-prem or the cloud. It's a distributed computation model where computation is closer to the source of data. For us, this means lower latency for insights, reduced cost, since we can run AI algorithms on the device, and it can run offline, which means surgeons can benefit without the need for internet access. And most importantly, privacy and security. As we process data locally, we can ensure patient anonymity. Let's cover the observability stack. Despite our device being on the cutting edge, our observability stack felt a bit lacking. As the device entered more hospitals and our team grew, we started experiencing some amount of friction. We had invented our own log-based events and metrics over MQTT, which we routed to tools like Honeycomb and our data warehouse. But for troubleshooting and identifying root causes, we needed logs. Due to the requirement that devices could function offline, we had designed a fairly simple yet effective system. We collected arbitrary logs for our running system deservices, as well as OS logs. In terms of forwarding, we were doing the following. We had a timer where every 24 hours, we would upload archived journal delogs to S3, and they would have a timestamp as their file name. Of course, that's assuming that there was connection available. We also gave our users the ability to initiate a log upload via the mobile user interface. In this case, they would have checked if the internet connectivity was available, and they would be informed if the upload failed. In hindsight, I think this is a brilliant way of solving the issue of getting logs off of the device. And in fact, as it stands, that's what we're still doing today. However, it failed to satisfy a few requirements. The support team and developers would have to manually look through S3 for the right date and time, often guessing if any of the files have anything relevant based on the file size. Searching across device IDs within a certain hospital site was basically impossible, and everyone had to use their own tools, whether that be grep, notepad++ or lnav, which I actually recommend. Furthermore, for local development, you still had to SSH onto device to tail the logs, and QAs would have to share logs for their tests on Slack so they can be reviewed. We needed an approach to log management. The basic use case was to have the ability to stream logs to a backend such as CloudWatch, Elasticsearch, Datadog. For that, we needed an agent for these requests. My first thought was to use Fluent Bit with our existing logging backend. After all, that's what I had done previously for our Kubernetes cluster. However, getting Fluent Bit deployed on an edge device isn't quite as simple as Helm install. I was faced with a completely new tool chain, things like the Yocto project open embedded in BitBake. I tried my hand at it. But introducing a new dependency meant setting up a new BitBake recipe, whatever that meant. Going through an OS verification checklist and various other compliance related tasks. Figuring out cross compilation since we compile everything from source during our OS build for security reasons. And I should mention that our release cadence is three months. Another problem that was that we already had an agent installed, and that was syslogng. Before I introduced an agent, I had to try the existing one. But I found syslogng's documentation a bit opaque, and the configuration was not intuitive to me. It didn't have CloudWatch support or an HTTP plugin out of the box. So I changed tact. I decided to use a vendor product since we already had some upfront cost. The Datadog solution attracted me. We were on the market for a better user experience on the platform side and Datadog had the promise of an agent metrics and logs with archival storage to S3, all with enterprise support. However, we struggled cross compiling the agent for architecture. I just wanted to get the agent working. So I downloaded the appropriate Debian package from their APT repo just for development purposes and got it running on my device. That's when the fun began. It worked well out of the box, tating files and journal delogs. The documentation felt a bit sparse, however. And it felt like it had a focus on metrics, which are very useful, of course. The ability to root and filter happened cloud side. And I felt like that would be locking us in. It was a good learning experience as it helped us identify what we actually wanted from a centralized logging solution. We came up with these takeaways. First, offline mode. Our devices were deployed in a suboptimal network environment. And often, there was no Ethernet cables in the operating room, which meant that logs would be far older than 24 hours. Cross compilation of the agent just needed to be simple as possible. We're a small team and we didn't have time to rework our always-built pipeline. We realized that installing a vendor's agent limits us in the future, especially with our current release cadence of three months and manual OS updates. Based on our security model, we wanted API keys per device, but Datadog, along with most other vendors, have about a limit of 50. In hindsight, I think that key could be used for a whole fleet, as long as you have a traffic encrypted. Finally, data export. We wanted to have the data for long-term auditing purposes and for our data science team to be able to access. And Datadog provided S3 archival, but you had to conform to their folder and file structure. And so, it felt like it was time for fluent bit again. It was built for logs first and has a thriving community. It's vendor agnostic, so it could ship logs to CloudWatch, Datadog, S3, and many other destinations that we might consider. It's highly performant as it routes the, as it has its routes and embedded devices. And now that I had learned the thing or two about cross compilation, I could appreciate its zero-dependency approach. It came with a BitBake recipe out of the box and a good description of the supported platforms. But when it came to open embedded, it only had a very old version available, so you couldn't just use it out of the box. You had to use the BitBake recipe I mentioned earlier. A tip I have is to look at the packaging folder inside the fluent bit repository. That's where you'll find Docker files for building various Linux images. You can use this for inspiration or to learn more about how to do things better. Another resource, of course, is the docs, just look for build options. Let's talk about offline mode. It's worth making the distinction between a lossy or noisy network versus no network. In hospitals, you certainly get both, but it was the latter that we were more interested in. We would have systems offline for weeks at a time. And to solve this, I think we had several options available. We could continue asking users to upload logs in an ad hoc manner. We could set up a large enough file buffer to capture some of the logs, although log rotation would eventually lead to data loss. And of course, we could set up a memory buffer and only a memory buffer so that the input plugins would be paused once it's full. I'd also like to mention a few config options that you have at your disposal for when you have a lossy network. There is the retry limit parameter available for output plugins. You can specify it as part of your output section. Just put an integer and this tells the scheduler to retry that many times. You can also set it to false, which means not to retry or set it to unlimited as well. If you're looking to set the time between successive retries, you can also play around with these parameters, which are in the service section. Base is the number used for the exponential backoff in seconds, and cap is the upper limit for the backoff. By default, base is set to five seconds, and cap is set to 2,000 seconds, which is 33 minutes. And if I'm correct, that gives you about 10 retries before you start approaching the upper limit. But bear in mind, there's jitter, so you're never actually going to hit that. There was a slight bug in the math of the algorithm in versions older than 1.8.7, so just be aware of that. Our devices are not static. They're turned on and off by users at will, so we needed a way to ensure the last logs were flushed, and that's where the grace period comes in. You can use it to set the service in the service section, and it sets the wait time before your process exits. You specify this in seconds, of course. However, the current behavior makes it so the task retries are not handled once the engine is shutting down. And as of this talk, there's an open PR to address this issue. Now let's quickly mention buffering. When you have devices deployed behind a customer's firewall, you will run into issues with back pressure. In other words, you won't be able to deliver the messages to the destination. The flunbit process will start consuming more and more memory, and eventually will be killed by the kernel. In embedded scenarios, it's always best to set the memory buffer limit for each of your input plugins. By default, this option is disabled. Once an input plugin reaches that limit, it will be paused. Memory and file system buffering mechanisms are not mutually exclusive. Indeed, when enabling file system buffering for your input plugins, you essentially get the best of both worlds. And I want to finish off with a warning. If you intend to ingest logs older than a certain time period, most backends might not accept it. Many of them support 18 to 36 hours, and some of them don't support out of order writes. That's why we're currently investigating elastic and open search, as I think we can customize the indices. One hack, potentially, is to write a Lewis script that rebases timestamps. I have an example here, but it's really just used for testing in CI CD. And now, let's briefly cover architectures. I will say, first start monitoring your agents as soon as possible using the HTTP server configuration. You've got Prometheus and JSON metrics. This will be useful for when you make capacity planning decisions. One thing that we discovered, for example, is this odd behavior. It turns out that we had a debug log that was being generated every time an LED would blink. We wanted to keep things simple. Initially, we started with an agent deployment as our architecture. Basically, each node has an agent running that fires into the backend. This has to be the most common way people deploy fluent bit in the cloud today. However, there are some disadvantages, and we run into them. It's hard to change the configuration across your fleet. For example, adding a backend or processing, and it's hard to add additional destinations. So we looked at the forwarder aggregator approach, where you have logs shipped to an intermediate server for processing and load balancing. Some of the advantages include less resource utilization, which means you can maximize throughput on your edge devices. You can allow processing at scale because you can independently scale your aggregator tier. And you can easily add backends and configuration changes because you're changing it in one place. The disadvantages, of course, are that you have to have dedicated resources to run this aggregator instance. We chose fluent D as there was more documentation around scaling and deployment. We had a deployment running in Kubernetes using a home chart. This was separate from our fluent bit deployment, which collected Kubernetes logs. Two uncommon reasons why I chose it was because it had the ability to have a pre-shared key between forwarder and aggregator, and it also has user authentication. Another reason was TLS termination was bundled into the agent itself. But I think you should be doing TLS at your load balancer rather. Another architecture that I considered was the so-called observability pipeline. It simply refers to the fact that there are some streaming technologies in the middle and you can do your own routing. However, I wouldn't recommend it as your first approach unless you have the expertise or you have the scale of, say, 10,000 messages per second. Although we had Kafka, we certainly weren't experts. We were still in the midst of a migration and we didn't have role-based access control setup. And also there would have been the headache of doing certificate management for these devices. So we settled on the common aggregator approach. But we needed to test and harden it. One small tip I have here is to set up synthetic transactions or heartbeats. Initially, your pipeline won't have much traffic, so you need some way of passing events through it. And this can be done by setting up a small fluent bit agent, for example, that has dummy messages being sent to the aggregator. It also goes without saying that you need monitoring for all your components. And pretty much all of them support Prometheus these days. You should have a look at the Prometheus exporter for OpenSearch, for example. One resource I found very useful was the Fluent Bit DevTools repository by the Calyptia team. It deploys Fluent Bit locally on a Kubernetes and Docker, aka Kind cluster, along with several popular backends such as Elasticsearch, Kafka, Grafana Lowkey, and Splunk. It's best and perhaps the easiest way I found to compare the different backends and just make sense of the integrations. I just want to call out some behavior that caught me out. When it comes to the Community Fluent D Helm chart, it would go into a crash loop unless you had established a connection with Elasticsearch. So while testing, maybe spin up a single cluster of Elasticsearch or temporarily remove the liveness probes. This was kind of a whistle stop tour of some of the challenges we faced, and hopefully this gave you some ideas on how to solve them. We're still in the midst of the migration actually. Once we finish, however, there's some future considerations. First is the S3 archive. Setup S3 would be ideal because we could use it for auditing purposes, and then this would be very useful for our data science team as well. We are already switching to open telemetry tracing for our platform. So the natural progression would be to implement this for our devices as well. And Fluent Bit provides us this unified experience, and so we can also set up things like Prometheus metrics as well with a single agent. We're still building our Kafka integration. Once that's up and running and once we have our back, I think we can reevaluate the observability pipeline. But again, we probably need the scale first. Fluent Bit now has support for the Kafka protocol as well as well as the Kafka Res Proxy as output plugins. And since version 1.9, there's a Kafka input plugin, but it's still proof of concept, but it's still awesome. At the very start, I mentioned that we use MQTT, which is a lightweight PubSub protocol that we already use for events and metrics. One experiment that would have been nice to have is to ship our logs using it. However, at this time, it's only a future request. Maybe it's about time I learned C. Thank you for listening. I'd like to connect and learn from your experience, so please feel free to ask questions on Slack today. I'd also like to extend a special thanks to Anurag Gupta and Patrick Stevens for their continued support and encouragement. Thank you.