 Okay. Hi everyone. I'd like to thank everybody for joining us today. Welcome to today's CNCF webinar. Take your monitoring to the next level. I'm Kristi Tan, Marketing Communications Manager at CNCF. I'll be moderating today's webinar. We would like to welcome our presenters today. Larian Heimovic, Co-Founder and CTO at Rookout, and Michael Aleil, DevOps at Rookout. A few housekeeping items before we get started. During the webinar, you are not able to talk as an attendee. There is a Q&A box at the bottom of your screen. Please feel free to drop your questions in there and we'll get to as many as we can throughout the presentation and we'll save some time at the end as well. This is an official webinar of the CNCF and as such is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all your fellow participants and presenters. Please also note that the recording in the slides will be available later today at the CNCF Webinars page at cncf.io. With that, I'll hand it over to Larian and Michael to kick off today's presentation. Take it away, guys. Hey, guys. It's great being here. Thanks to the CNCF for hosting us. My name is Ivan Hamovich. I'm Rookout's co-founder and CTO. Here with me is Michael Aleil, our head of DevOps, and it's great being here, speaking with you. Throughout these presentations, Michael and I will be speaking interchangeably and we'd love to answer any questions you might have, so feel free to send them throughout the presentation and we'll try to combine them into it. So let me start by introducing ourselves. As I mentioned, my name is Liran Hamovich. I'm Rookout's co-founder and CTO. I've spent most of my career doing cybersecurity, starting off in researcher and developer roles, followed with team leader and product management roles. Along the way, I kind of fell in love in modern software management technologies, to see that agile DevOps already lean, and that's kind of how I fell into monitoring, as well as other disciplines, and also the deep passion to understand how software actually works, which kind of combines the story of Rookout, but that's for another time. Hello, everybody. My name is Michael Aleil. I'm a DevOps at Rookout, and I'm a software developer. Before I jumped head first into the ocean of development and DevOps, I was learning the ropes the hard way. When you start out, it's not easy. And one of the way that I stay sent is by trying to walk smarter than harder. That means automation and doing things the right way. So today, we wanted to talk to you about monitoring, and I think the first thing we want to discuss is the importance of monitoring. Why should you care about monitoring? I'm wondering, what's your favorite ID? How do you like to write code? Well, Michael, what's your favorite ID? Well, for example, my favorite ID is Web Store. I think it has all the necessary tools and all the features I'd like to have when I develop something out of an application. So for me as well, I am a fan of internet and other ideas, but in the end of the day, nobody truly cares about how do I, or we tell or any of us, write, build, test our code and how does it run on our laptops. What our, for instance, look at our CEO cares, what our board members care, what our customers care is, how does it operate in the real world? How do customers experience the product? How do they benefit from it? And that's the essence of the importance of monitoring because that's the only way you can truly know what's happening when the code is running away from you and in real environments. So why should you personally care? I'm guessing that most of you aren't founders or don't have to deal with customers most of the day, but as developers, even front-end, back-end, or even if you're doing operations, I think that most of us care about what is the experience that our customers or users are having as a day-to-day in our application. And that's one of the reasons you should care. That's also one of the reasons that hookup was founded to give the possibility to understand the application more intricately. Speaking of monitoring, there are two types of monitoring, two basic approaches. There's white box and black box monitoring. Let me start with telling you a bit about white box monitoring. And basically it wraps up as monitoring the software as if you wrote it, which we in fact did. And you can instrument your code by placing logs, metrics, or just any traces and see what is going on inside your application in real time kind of when it's running in production and not on your own laptop or dead machine. The second type is going to be black box monitoring. So that means you're going to interact with your application as an outsider, as the user would do it. For example, that means, you know, going into a website, clicking on a button, purchasing something, and that's basically the way that a user would experience his way around your application. And so let's dive a little bit deeper into black box monitoring. It's actually the same pair of the two in many ways. It's about monitoring the system from the outside. You exit as a customer. And so there are very little requirements. At the end of the day, you can very easily do it. And there is a whole bunch of tools that allow you to do it. To do this, think them and status cake allow you to send out DNS request to essentially HTTP requests. There are fancier tools such as the data, the detectives that allow you to actually simulate events request, and even full on a testing tools, such as Selenium cypress and test cafe, which allow you to see the latest sessions, and see the results. are often used for testing and integration tests, front end tests, they can also simulate a customer or user at a very high fidelity and seeing entire flows going through the system. The nice thing about black box monitoring is that you can always use it and so it's a great way to solve the problem when white box monitoring isn't enough. Now why wouldn't white box monitoring be enough? Maybe you've just created a new product and people aren't using it all that much just yet but you want it to be ready for primetime so that when people do use it it will be ready. So if you don't have any real-world traffic then white box monitoring is going to show you zero activity either way. Using black box monitoring you can simulate that activity and see that it's working properly. Maybe you already have an activity in the system but you have some very important edge cases that aren't being executed often. For instance making a purchase or changing a configuration. Those actions might be rare but vital and so you might want to do them then just using black box monitoring as well. And last but not least occasionally you see that you do have traffic but the traffic is highly unpredictable, highly variable. There are a lot of natural errors that occur within the applications traffic and so it's hard to monitor for errors or latency and so on because it's highly predictable and so using black box monitoring and running predictable well-known scenarios allow you to execute it in a way that you can measure in anticipates. On the other end you have white box monitoring which is basically the opposite. It's monitoring the software from the inside and we kind of divided it into three different three different layers let's call them causes symptoms and business. So causes is is what broke what what is breaking your system basically what is making your system not fork the way that is intended to. Then you have symptoms so that's going to be why is broke the way it did and the list none of the three is going to be business monitoring and that's a bit less standout and you probably haven't heard of it before but that's the way that you can actually gauge how customers or users are using your application and how that impacts your software development cycle. So let's start with causes. Causes were defined very well in the Google's SRE book and that's essentially a defect causing the system to misbehave. This is whatever is causing whatever the problem is in the system whatever infection so to speak in the system is causing it to misbehave and that can be a network outage a hard drive failing or a data center crashing whatever and how do you monitor for it? It's about setting metrics and thresholds first and foremost at the infrastructure level and load balancers and the hard drive operating system and so on. There are a whole bunch of tools for that some of them are commercial such as a Dynamics New Wellig and Datadog but some are open source such as Prometheus or Elasticsearch. Now it's important to note that systems are very very big and there are a lot of moving parts and each of one of those can fail but not any failure in a system especially in a large cloud native system is going to cause an outage or problem and so you want to limit your focus on monitoring causes for the causes of truly matter. One way to try and catch it which you can discuss which we're going to discuss later is focusing on causes that have caused problems in the past because if it happened once it's likely to happen again and you want to monitor for it. At the same time don't try to monitor everything because that's going to be a lot of work and there's a good chance that those local failures are not going to hurt your system. Yes I'm going on to symptoms so as I said before it's whatever is broken in your system as a whole what caused an outage or a kind of issue or degradation of service and the way you monitor that is going to be by setting basic metrics and thresholds and creating monitors or alerts at the application level of your of your server and you can use the same exact tool at Dynamics New Wellig, Datadog, Prometheus, Elasticsearch and many more and it's something that you should almost always use why because most of the time when you integrate your software that you're using into any kind of these tools you will have a default dashboard with any kind of metrics and that can be represented as for example HTTP endpoint metrics like error codes and latency or request per second and that's always something that you want to keep an eye on when working on your application so you know how it fails, how much load it can get up to but as Miran said it's not something you have to overdo you have to think about what you want to monitor and see because if you have too much data in front of you then you get overwhelmed and it's hard to understand what's actually important in all of that. We're going to show some example of that later but getting started with symptoms monitoring is very, very easy, very straightforward and can help with monitoring almost any system. Yeah so first of all we're going to talk a bit about an example of this kind of monitoring so this is taken from the Reddit status page and they show a bit of the system metrics on the website and as you can see you have all kind of graphs and dashboards showing all kinds of lines that represent a lot of things like the request rate of the website itself, how much errors the Reddit gets over a certain period of time or how much backlog that means how many posts or how many votes are in the queue of being processed in the application and as an example I will take the the graph about the request rate and talk a bit about the trend that is going to it's kind of going up and down and it represents how much traffic goes through in general to the application to the website over the course of a day and you'll see that over the whole week as it's shown here is quite stable and quite the same trend but the moment you know that and you have some kind of data to check it against then the next time that trend is going to change and you're going to have a bigger peak you're going to know and be able to predict that it's going to be most likely an issue with the application. And so the final type of monitoring the most interesting and the most underutilized one is business monitoring. What business monitoring is about monitoring the value system is creating for your user and your business. A whole bunch of famous companies that have done a lot of it for instance Netflix is famous for monitoring the number of plays on their videos. If people aren't playing enough videos then they aren't generating enough value, people aren't using them enough and that's a problem. On the other hand by the way if people are too often clicking the play button then there might be a problem with the streaming service that shows aren't getting started so people are repeatedly trying to launch new shows. Facebook started with likes, they are monitoring the number of likes in the system. For instance if you are having a beer with people from Facebook or Instagram and a bunch of servers are crashing they're probably not going to care they have way too many servers to care about a server or two crashing. On the other hand if a number of likes per minute is going down and they're on call then they are going to drop everything and go ahead and check it out because this means something is wrong with the system and they are not making as much money and it doesn't matter if it's because the system is in a computer is down, the network is down or a baby player carried. The business is not getting what it's need and so we have a problem. And so that's exactly how you monitor it. You set cost to metrics and thresholds that measure what you care about and we're going to show a couple of examples. This time you can use the same set of tools, the same set of APMs but you can also use various BI tools if you're looking at more long-term analytics and you should definitely use business monitoring as much as you can because it provides you with much more accurate perception of what value you are generating and the challenging part here is not only defining but getting enough transactions, getting enough flow so that you can see the ups and downs and one way to do that is also to extend the time frame instead of monitoring every minute which is obviously awesome. You can also monitor every hour over 24 hours. You definitely don't want production down for 24 hours but if you just release the version and conversion rates are going down then it's better to find out after 24 hours than not knowing about it at all. And here are some examples for business monitoring. So some of the things that you'll be able to monitor as part of your business is going to be the first thing is visitals. How many people actually come and use your website or application? How many people log in? How many people sign up? And that can be a metric that you check per day, per week, per month. That's entirely up to you and what you need. Or as we talked about Facebook, like share and accessing content. If a lot of people click the like button, if a lot of people share a certain post, it might mean something is working but if less people are doing it, it might mean something is broken or not working as intended. Or taking as example other kinds of platforms, for example Zoom, how many people schedule a meeting per day or does it change when something special happens in the world or during a holiday or something like so. And the most important one that's what most companies care about is revenue. And for example as part of an e-commerce website, you can be able to monitor how much people are purchasing a certain item or upgrading a certain service and that might tell you something is working better than some other things. As one example, for example from what we are doing, we're producing a periodic final report monitoring our self-service signups. And every week we're able to see how many people actually sign up to the application and how many people do get how far in the process of the signup and starting in the application. And that's something that helps us a lot in having a better experience for new users. So I wanted to share with you a story. This is a story from the very first very, very early in Rookout. It's about two and a half years ago. Rookout was still a two-person team, just me and my co-founder. And at the time we weren't even VC funded. And we've just deployed Rookout to our very first customer, which by the way got acquired by Google since then. And so our customer, the VivoFoundee, installed Rookout, the data collection platform on his app and then logged into our platform online, went into his server, set the breakpoints, and so that he could collect live data from that remote app on the fly without having to write any code, rebuild or redeploy the application. And so he was happy, saw data coming in, data flowing. And so he turned, closed his browser and went home. At the same time, RSDK deployed on his Python application, kept sending data to our load balancer residing on Google at the time. And then data kept flowing into our Python backend, which were deployed in active, active mode behind the load balancer. Each of those backends got a piece of data and put it into the Redis database. We used the time to deploy. When I got in the morning, the system crashed and then I discovered the system crashed was caused by Redis going down. So when I got there in the morning, what I saw that the load balancer returned 502 errors, gateway inaccessible for all incoming requests. Checking out Kubernetes, which one of the backends was deployed, I saw that all backends were in a crash loop. They were in a crash loop because their hands checked include reading and writing from Redis, which turned itself into a read only mode, which is what Redis does when it is unable to back up itself. By default, when Redis is unable to fork, which requires a lot of memory, and back itself up, it moves into read only mode so that you would notice something wrong. And so overnight, his data kept flowing in from the customer's application, and our Redis bloated. At some point, it had more memory than the machine could hold, and it couldn't fork itself anymore. And so everything kind of crashed. And so it wasn't too hard to fix it, obviously. All we had to do was increase the size of the Redis RAM, the Redis distance RAM. And since those early days, our architecture is much bigger and much more resilient. But still, at the end of the day, it's a good story to see in a simple system how things break, and how can we monitor for them? Some example of what could have been done better to monitor and even prevent that downtime. And it kind of warps in with all the layers of monitoring we've talked about until now. So we could have as easily as having some metrics about 502 errors, or any kind of HTTP errors, and we could have known that something was not working. Because if the error rate was 100%, obviously the system is not working as intended. Or if we'd have an end-to-end flow, actually simulating a user doing what a system is supposed to do, end-to-end, then we should have known that something was failing in the middle of that process. Or as the last part of what we talked about, the business monitoring, there's actually no activity in the system. No new breakpoint could be placed, no data was flowing, nothing was happening. So it is a good indicator that something was wrong. And it goes on and on. And the most important one here, I think, is the database memory. When talking about databases, they are very prone to having issues, especially with IO, like memory and disk use. And that's one of the first things that's also easiest to monitor. It's very simple to put a new dashboard showing you how much memory or data your database is using and get an alert when it's getting close to high. And then you could even, as I said, prevent that downtime. I think the key takeaway from this slide is that you can use multi-layered monitoring. You can have many different forms of monitoring. Monitor both for causes, symptoms, business, and even black box monitoring. And you can monitor different areas of the system. And each of those can, especially when it comes to critical big errors, chances are you're going to see them across the system and then see them in multiple places. And it's very easy to get that. So get started with monitoring the more you have the better. But even just a handful of monitoring can take you a long way, especially when it comes to those big errors, everybody's afraid of. And so we've covered some of the basics, some of the concepts of how can you monitor. We've shown a very first downtime. And we are fortunate enough to not have many, but it wasn't the only one. But how should you go about starting your own monitoring program or improving your system monitoring? Where should you focus on? So I think at the end of the day, monitoring is very basic to start with. But it's kind of an iterative process, a multi-layer kind of monitoring. And it's going to go, it's going to get better and better over with time as you learn from your past experiences and learn about what you actually need as a business or as part of your application and what you need to know about what's happening inside of it. So the first thing that you can start and monitor is going to be HTTP, DNS, and TLS. So you can have easy, easy tools that can tell you when your certificates are going to expire, when your DNS is not up to date or not working, or even just trying to reach your system and can tell you when it is down. And to insanity test test and simulate how a user might experience going into your application and using it as of right now. You can have some kind of value indicator activity, meaning that you can monitor, you can monitor how your system is actually, is actually giving value to a user. And if something is not working as intended, for example, Netflix, if you click the play button and don't get a video showing, then it's not giving any value to the customer and in turn will down the revenue of the business. And the most you need to set up obviously the endpoint latency and error rate, queue item counts, it's something that comes most of the time by default with a lot of tools. But I'm going to let a bit learn, talk about databases and why this point is especially important. So what I like the most about endpoint latency and error rates is that it's one of the easier things to set up. As Michael mentioned, many APMs have been out of the box. And it's fairly easy to configure. Just if HTTP requests take more than half a second on average, then that's a problem. If more than a 5% of my requests are 500 or several errors, that's a problem. And that those numbers I just wrote might or might not work for your setup. Maybe your setup need requests under 50 milliseconds, or they can be under 5 seconds. But that's very easy to experiment with. Take it a bit up, take it a bit down. And in a couple of iterations of very little work, you've got yourself some basic monitoring to see that everything is going well, or that something's wrong. And that's obviously as things move along, you can obviously make it more complex, divide it into various endpoints or divide it into various servers. But even this very basic configuration helps you get started. So another thing that's definitely worth checking out is queue icon counts, as queues items increase, as backlogs increase, especially if it goes beyond the third rate, when your system is not ingesting a request date or whatever at a fast enough rate. And that's a problem that's waiting for happen. Obviously queues are there to protect it, to buffer you, so that you have some time to response. But if those numbers are going up, you probably want to check out why maybe scale up the system or figure out what's going on before things break in a nasty way. As we mentioned in the past, databases tend to be very resource intensive, and they tend to be very critical to the system. And so you should definitely monitor databases, especially monitor the memory and disk space, and if one of those seems to be going, seems to be filling up, then your system might be in the way for how time. So definitely check out those causes. That's the only cause I would recommend starting out with, unless you know what you're looking for. So the alternative way that we choose to use monitoring is based on three steps that you're going to do again and again to get better at it. And first of all, you're going to have to decide what you want to know about the system and how it's behaving. The second thing is going to be to install and configure a few tools, meaning you want to collect data using these tools. It can be metrics, it can be logs, it can be traces, and anything that you want to observe. The next step, once you have some data, is going to be to analyze and see how it works for you, how it shows the actual state of the system, the actual state of the business, and what you can actually understand from it. And once you've done all of these three, you start to know what is working for you or what is not working for you. And one of the most important things I think is that if you overdo it too much, you're going to have a hard time understanding what is actually important. If you have 15 monitoles in a layer that's going on at the same time, it's going to be even harder to know what is wrong with the system as a whole. But if you have, let's say, five monitoles that are very specific and tell you what is wrong at a glance, it's going to be a lot easier for you going forward. And so you start out by planning your monitoring. What are you trying to achieve? What do you need to know? What data record you like? If you have no monitoring, then probably any piece of monitoring, any piece of data will be useful. If you already have some monitoring in place, then think about what's missing, what's keeping you up at night because you have too many layers and you need more data to fine tune them. What's missing and didn't make you up when the system went down. And in general, which parts of the system do you care about the most? What's the most danger to your business? What's the most risk? What do you want to make sure that if something goes wrong, you know? So the next part is going to be implementing the data. You can install any tool. It's going to be going from data docs in your relic to open tracing to any kind of log file planning utility and instrument your application by setting and sending metrics that are important to you about your users, about the flow that your application is going through, and pipeline the data where you're actually gathering everything. So this is actually one of the things people are about ROOKOUT. ROOKOUT allows you to instrument the application without having to write more code, redeploy, restart and so on. You can just get the data on the fly from the existing deployment without waiting for the next sprint. And then once you get the data, either the easy way or the hard way, you have to analyze it. You have to observe the data and see what is the data showing. What would be a useful dashboard for you to know the overall state of the system? Are things going up or down? Is performance improving or degrading and so on? Things through. What do you want to be allowed to develop? And when that happens, do you truly want to be allowed to develop? Was that the problem that you truly required you to wake up and deal with it? Or was that something minor that could have waited until morning or just resolved itself and you don't care about? And last but not least, ask yourself what questions can't you answer with the data you already have pushing you back into the planning cycle to go ahead and get more data? So we do have a question about the audience, from the audience. I'm going to read the question aloud and we can answer it. Are we talking about combination of app telemetry collection, usability, infrastructure and app monitoring in the same platform? There is a multitude of OSS and commercial solutions out there, including Prometheus, Wavefront, Neuralic, AdDynamics, etc. What is your main differentiator? So ROOKOUT isn't a monitoring software. ROOKOUT is a data collection and pipeline platform. We allow our customers, software engineers, to provide software on the fly. And so we allow software engineers to connect to code running and staging production other remote environments and extract application snapshots, metrics, logs on the fly and pipeline into any of those other analytics tools without having to go through the rebuild, redeploy, restart cycles. Michael, anything you want to add? Yeah, just that, you know, all that process of rebuilding, rewriting code, going through test and CI CD and all that, that is what is taking the most time for, I think, a lot of the most developers. And that's also what we see here with monitoring. It's that interactive improvement that takes the most time getting it right, because we don't get it right the first time. As much as we want to think we are the best in our field, it's hard to do it the first time. Overall ROOKOUT is used for a better understanding code, learning code, debugging code, and monitoring code and not just monitoring itself. If you guys have any more questions, feel free to... So that's kind of all we had for today. We'd love to hear any questions Michael? Yeah, just a friendly reminder that there is the Q&A box at the bottom of your screen in the Zoom. Feel free to submit questions. We definitely have some extra time. In the meantime, there was a question that I answered in text about, can we get the slide deck? So yes, it will be available after we finish this recording. Christy, are you able to tell where people can find that, the slides? Yeah, most definitely. It'll be on the CNCF webinars page. So it's dncf.io slash webinars. And as we were saying, the recording on the slide will be available later today. And it looks like we have a question that just came through. Yeah, what is the difference between ROOKOUT and Jager or other open telemetry products? Go for it. Yeah, so the main difference is with ROOKOUT you're able to, let's say, debug or understand things live in production. Whereas with Jager open telemetry it's going to be, again, that process of you're looking at your metrics and monitors, dashboards, and you're always missing something. You don't have the full picture and you want to add something new. So what you're going to do in the traditional way is going to be to add these metrics in your code, redeploy, go through the entire pipeline, and maybe in a few hours or a day later you might see the new results. And what we do allow to do is do that in a matter of one click in a few seconds. I think the best way to learn more about ROOKOUT is to go online to the ROOKOUT website and check out the videos. I think they're going to get you and try to understand a lot what ROOKOUT does a lot better than anything we can say. And feel free to reach out to us to learn more about it. Yeah, definitely. The second question, how do you collect the metrics? Are there sidecar containers, agents, etc? I'm thinking that you're asking about how we collect the metrics at ROOKOUT. So Liran, if you want to answer this one. So ROOKOUT is an SDK you install into your app that allows you to do metrics collection, low collection on the fly as simple as that. So if you're running in Kubernetes, you're going to add the SDK into your app, build it into the container image, and that's it. Oh, a lot of other questions. Hi there. In your opinion, what would be the cloud native monitoring stack, the factor in the market? I think it all depends on what your tech stack looks like. Are you purely cloud native, or do you already also have some more legacy software, whether that's Java web applications or mainframes? Do you need those stacks to be combined together? If so, you probably want to look at something more traditional, such as fDynamics. If on the other hand you're purely cloud native, if you're managing your own servers, then Devdog is a good combination of infrastructure and APMs. While if you're letting somebody else manage servers for you, you might want to look at something that's more tracing oriented, such as Exegon or Honeycomb. There's a big difference, and obviously some of your security and performance needs are going to implement that as well. I guess if there was one size that fits everybody, then that window would own the market. It looks like there's another question in the chat, and that didn't make it over to the Q&A. I'll read it to you, guys. It's from Josh. It says, can you show how Rookout provides more insights than typical APMs like Yeager? I'm pretty sure the answer, you're telling me, but I'm pretty sure the answer is no. We're not allowed to show the product on this webinar. Correct me if I'm wrong. Yes, that is correct. The answer to that is no, but feel free to reach out on LinkedIn, Twitter, or just check the website. We'll be happy to show you in private more about it. Great. Great. Okay. Well, I think that covers all of the questions. Thank you again, Liren and Michael, for a great presentation today, and thank you all for joining us today. Just another reminder that the recording in the slides will be available later today on the CNCF webinars page. Oh, and we got one last question here. Could Rookout do the same thing like Scout APM that traces N plus one problem? I personally don't know Scout APM, Liren. I don't know Scout APM either, so I'm going to need some more details on that. Sorry. We'll check it out after and drop us an email and we'll get back to you. No problem. All right. Anyway, thanks again, everyone for joining us. We hope to see you at a future CNCF webinar. Have a great day. Stay safe. Bye. Thank you for coming, everybody.