 My name is Molly Struvi. Welcome to Caches King. I am a senior site reliability engineer at Kenna security and today I'm going to show you guys how you can use Ruby to kick your performance up to the next level. Now, when most people hear site reliability engineer, they don't necessarily think of Ruby. Instead, they think of someone who's working with MySQL, Redis, AWS, Elasticsearch, Postgres, or some other third-party service to ensure that it's up and running smoothly. For the most part, this is a lot of what we do. Kenna's site reliability team is still relatively young. We're about a year old, but when we first formed, you can bet we did exactly what you'd expect. We went and we found those long-running horrible MySQL queries and we optimized them by adding indexes where they are missing, using select statements to avoid M plus one queries, and by processing things in small batches to ensure we were never pulling too many records out of the database at once. We also had Elasticsearch searches that were constantly timing out, so we rewrote them to ensure that they could finish successfully. We even overhauled how our background processing framework rescue was talking to Redis. Now, all of these changes led to some big improvements in performance and stability. But even with all of these things cleaned up, we were still seeing high loads across all of our data stores. And that's when we realized something else was going on. Now, rather than tell you guys what was happening, I want to demonstrate it. So for this, I need a volunteer. Okay. I know you guys are shy. I promise it's going to be really easy. Come on. I'm going to pick somebody. Anyone? Get on up here. Come on. Thank you, sir. Trust me. It'll be, it'll be good. You're good. What's your name? Excellent. Okay. Stand right there. Everyone, this is, oh, what's your name? Okay. I think I got it. Don't got it. What's your name? Maybe one more time. What's your name? Just stick with Bill on this one. Okay. Okay. One last time. What's your name? Okay. About how annoyed are you right now? So you're a little annoyed. Okay. How annoyed would you be if I asked you what's your name a million times? There you go. Yeah, it'd be pretty annoying. Shut down and not to mention we'd be here all night long. So what is one easy thing I could have this person do so that I don't have to keep asking them their name and I'll give you a hint. It involves a pen and a piece of paper. Shout it out. Bingo. Okay. So if you could write your name down right there. That's totally fine. Beautiful. Okay. Okay. Now that I have Bill's name written on this piece of paper, whenever I need it, all I have to simply do is read the piece of paper instead of having to explicitly ask Bill and wait for a response. This is exactly what it's like for your data source. Imagine I'm your Ruby application. Bill's your data store. If your Ruby application has to ask your data store millions and millions and millions of times for information, eventually your data store is going to get pissed off. Not to mention it's going to take you a long time to do it. If instead you make use of a local cash, which is essentially what this piece of paper is doing, your application can get that information. It needs a whole lot faster and it's going to save your data store a whole lot of headache. Thank you, Bill. Everyone give him a round of applause. Okay. So the moment we realized at Kenna that it was the quantity of our data store hits that was wrecking havoc on our data stores, that was a big aha moment for us. And we immediately started trying to figure out how we could decrease the number of data store hits our application was making. Now, before I get into all the awesome ways we use Ruby to do this, I first want to give you a little bit of a background on Kenna so you have some context around the stories I'm going to share. Kenna helps Fortune 500 companies manage their cybersecurity risk. The average company has 60,000 assets. You can think of an asset as basically anything with an IP address. In addition, the average company has 24 million vulnerabilities. A vulnerability is basically any way you can hack an asset. Now, with all of this data, it can be extremely difficult for companies to know what they need to focus on and fix first. And that's where Kenna comes in. At Kenna, we take all this data and we run it through our proprietary algorithms. And then those tell our clients what vulnerabilities pose the biggest risk to their infrastructure so they know what they need to focus on and fix first. Now, when we initially get all this asset and vulnerability data, the first thing we do is we put it into MySQL. MySQL is our source of truth. From there, we then index it into Elasticsearch. Elasticsearch is what allows our clients to really slice and dice their data any way they need to. In order to index assets and vulnerabilities into Elasticsearch, we have to serialize them. And that is exactly what I want to cover in my first story, serialization. Particularly, I want to focus on the serialization of vulnerabilities since that is what we do the most of at Kenna. When we first started serializing vulnerabilities for Elasticsearch, we were using active model serializers to do it. For those of you who are unfamiliar, active model serializers hook right into your active record objects. So all you have to do is literally define the attributes you want to serialize. It takes care of the rest. It's super simple, which is why it was our first solution. However, it became a less great solution when we started serializing over 200 million vulnerabilities a day. As a number of vulnerabilities, we were serializing increased, the rate at which we could serialize them dropped dramatically. And our database began to max out on CPU. The caption for this screenshot in Slack was 11 hours and counting. Our database was literally on fire all the time. Now, a lot of people might look at that graph and their first inclination would be to say, why not just beef up your hardware? Unfortunately, at this point, we were already running on the largest RDS instance AWS had to offer. So beefing up our hardware was not an option. My team and I, when we looked at this graph, we thought, oh man, there must be a horrible query in there that we missed. So off we went hunting for that elusive horrible My SQL query, much like Liam Neeson and taken, we were bound and determined to finding the root cause of our My SQL woes. But we never found those long running horrible My SQL queries, because they didn't exist. Instead, what we found were a lot of fast millisecond queries that were happening over and over and over again. All these queries were lightning fast. But we were making so many of them at a time that our database was being overwhelmed. We immediately started trying to figure out how we could serialize all of this data and make less database calls when we were doing it. What if, instead of making individual calls to My SQL for every vulnerability, we grouped all those vulnerabilities together at once and we made a single My SQL call to get all their data. From this idea came the concept of bulk serialization. In order to implement this, we started with a cache class. This cache class was responsible for taking a set of vulnerabilities in a client and then running all the My SQL lookups for them at once. We then took this cache class and we passed it to our vulnerability serializer, which still held all the logic need to serialize each individual field, except now when it would serialize a field, it would simply talk to our cache class instead of My SQL. So let's look at an example. In our application, vulnerabilities have a related model called custom fields. They basically just allow us to add any attribute we want to a vulnerability. Before, when we would serialize custom fields, we would have to talk to the database to do it. Now, we simply could read from that cache. The payoff of this change was big. For starters, the time I took to serialize vulnerabilities dropped dramatically. Here is the console shot of how long it takes to serialize 300 vulnerabilities individually. Takes us just over six seconds. And that's probably a pretty generous estimate considering it would take even longer when the database was under load. If, instead, you serialize those exact same 300 vulnerabilities in bulk, boom, less than a second to do it. These speedups are a direct result of a decrease in the number of database hits we have to make in order to serialize these vulnerabilities. To serialize those 300 vulnerabilities individually means that we have to make 2,100 calls the database. 2,100. To serialize those 300 vulnerabilities in bulk now means we only have to make seven. Boom, again. You guys can probably glean from the math here. It's seven calls per vulnerability or seven calls for however many vulnerabilities you can group together at once. In our case, we are usually grouping vulnerabilities in batches of 1,000 when we're indexing them. So we roughly took the number of database requests we were making per batch from 7,000 down to 7,000. This large drop in MySQL request is extremely apparent in this MySQL queries graph where you can see the number of requests we were making before and then after we deployed the bulk serialization change. With this large drop in database queries also came a large drop in database load, which you can see on this CPU utilization graph. Prior to the change, we were maxing out our database. Afterwards, we're sitting pretty chilly around 20, 25%. And it's been like this ever since. So the moral of the story here is, when you find yourself processing large amounts of data, try to find ways to use Ruby to process that data in bulk. We did this for serialization, but it can be applied any time you find yourself processing data in a one by one manner. Take a step back and ask yourself, is there a way I could process this data together in bulk? Because one MySQL call for 1,000 IDs is always going to be faster than 1,000 individual MySQL calls. Now unfortunately, the serialization saga doesn't end here. Once we got MySQL all happy and sorted out, then suddenly Redis became very sad. And this folks is the life of a site reliability engineer. A lot of days we feel like this. You put one fire out, you'll start one somewhere else. You speed one thing up, the load transfers to another. In this case, we had transferred the load from MySQL to Redis. Let me explain why. When we index vulnerabilities into Elastic Search, we not only have to talk to MySQL in order to serialize them, we also have to talk to Redis to know where to put them. In Elastic Search, our vulnerabilities are organized by client. So to figure out where our vulnerability belongs, we have to make a get request to Redis to fetch the index name for that vulnerability. When preparing vulnerabilities for indexing, we gather all the serialized vulnerability hashes. And one of the last things we do before sending them to Elastic Search is we make that Redis get request to retrieve the index name from Redis for that vulnerability based on its client. These vulnerability hashes are grouped by client. So this Redis get request is returning the same information over and over and over again. All these simple get requests also are blindingly fast. As you can see, they take a millisecond to execute. But as I stated before, it doesn't matter how fast your requests are, if you're making a ton of them, it's going to take you a long time. We were making so many of these simple get requests that they were counting for roughly 65% of the time it took for us to send the index vulnerabilities. Which you can see in that table, it is represented by the brown in this graph. The solution to eliminating a lot of these requests once again was Ruby. In this case, we end up using a Ruby hash to cache the index name for each client. Then when looping through those serialized vulnerability hashes to send to Elastic Search, rather than hitting Redis for every individual one, we would simply reference that client indexes hash. This means we now only have to hit Redis once per client instead of once per vulnerability. So let's look at how this paid off. Given we have these three batches of vulnerabilities, no matter how many vulnerabilities are in each batch, we're only ever going to have to make three requests to Redis in order to get all the data we need to know where to put them in Elastic Search. As I mentioned earlier, these requests usually contain a thousand vulnerabilities apiece. So we roughly decrease the amount of hits we were making to Redis a thousand times, which in turn led to a 65% increase in job speed. Even though Redis is fast, a local cache is faster. To put it into perspective for you, to get a piece of data from a local cache is like driving from downtown LA to LAX to get it. Not in rush hour traffic though. That's a whole different story. To get the same piece of information from Redis is like taking a plane and flying from LAX all the way to Denver to get it. Redis is so fast that it's easy to forget you're actually making an external request when you're talking to it. And those external requests can add up and have performance impacts on your application. So with these maps in mind, remember Redis is fast, but a local Ruby cache, such as a hash cache, is always going to be faster. So now these are two great ways we can use simple Ruby to replace our data store hits. Next, I want to talk about how you can use whatever framework you're using to replace your data store hits. So this past year at Kenna, we recently sharded our main MySQL database. And when we did, we chose to do it by client. So each client's data lives in its own sharded database, which means when we get data for a client, like, say, an asset, we have to know what sharded database to talk to for that client. This means our sharding configuration, which tells us what client belongs on what sharded database, needs to be easily accessible, because we have to access this every single time we make a MySQL request. We first turned to Redis to do this because Redis is fast, and the configuration hash we wanted to store was small. But that didn't last for long. Eventually, that configuration hash grew and grew. Now, 13 kilobytes might not seem like a lot of data, but if you're asking for 13 kilobytes of data over and over again, it can add up. In addition to this, we were also increasing the number of workers that we had working on every single one of these databases, until we had 285 workers chugging along at once. Now remember, every single time one of these workers makes a MySQL request, it first has to go to Redis to get that 13 kilobyte configuration hash. This quickly added up, and soon we were reading 7.8 megabytes per second from Redis, which we knew was not going to be sustainable as we continue to grow and add clients. One of the first things we did when trying to figure out how to solve this issue was we decided to take a look at Active Records Connection Object. Active Records Connection Object is where Active Records stores all the data it needs to know how to talk to your database. So naturally we thought it might be a good place to find a solution for storing our sharding configuration. So we jumped in a console to check it out. And what we found was not an Active Record Object at all. It was this octopus proxy object that our octopus sharding gem had created. This was a complete surprise to us, and so we immediately started digging into our gem source code, trying to figure out where this proxy object had come from. And when we finally found that proxy object, much to our delightful surprise, it contained all these great helper methods that were already storing our sharding configuration. Boom. Problem solved. Rather than hitting Redis every time we made an iSql call, all we had to do was talk to our Active Record Connection Object. One of the big things we learned from this whole experience was how important it is to know your gems. It's crazy easy to include a gem in a gem file, but when you do, makes you have a general understanding of how it works. I am not saying you need to go and read the source code for every one of your gems, because that's going to take forever. But consider this. The next time you add a gem, maybe the first time you configure it, you do it manually in a console. So you can see what it's doing and how it's interacting with the rest of your databases and code. If we had had a better understanding of how our octopus gem was configured, we could have avoided this entire Redis headache. Regardless, though, of where the solution came from, once again, caching locally, in this case, using our framework as a cache, is always going to be faster and easier than having to make an external request. Okay, these are three great ways that you can use Ruby to replace your datastore hits. Now I want to shift gears and talk about how you can use Ruby to avoid making datastore hits you don't need. Now I'm sure some of you guys are thinking, dah, I already know how to do this. But let's hold up for a minute, because this might not be as obvious as you think. For example, how many of you guys have written code like this? Come on, I know you have, because I have written code like this. Okay, this code looks pretty good, right? If there's no user IDs, then we're going to skip all of this user processing. It's great, it's fine, right? Fortunately, that assumption is false. It's not fun. Let me explain why. Turns out, if you execute this where clause with an empty array, you're actually going to be hitting my SQL when you do. Notice, this where one equals zero statement, this is what active record uses to ensure no records are returned. Sure, it's a fast one millisecond query, but if you're executing this query millions of times, it can easily overwhelm your database and slow you down. So, how do you update this chunk of code to make your site reliability engineers love you? You have two options. The first is, by not executing that my SQL lookup unless you absolutely have to. And you can do this by doing an easy, peasy array check using Ruby. By adding this line, you can avoid making a worthless data store hit and avoid overwhelming your database with useless calls. In addition to not overwhelming your database, this is going to speed up your code. Say you're running this chunk of code 10,000 times. You run that useless my SQL lookup 10,000 times, it's going to take you over half a second to do it. If instead you add that simple line of Ruby to prevent that my SQL lookup and you run a similar block of code 10,000 times, boom, less than a hundredth of a second. As you can see, there is a significant difference between hitting my SQL unnecessarily 10,000 times and running a plain old line of Ruby 10,000 times. And this difference can add up and have an impact on the performance of your application. A lot of people like to look at this top chunk of code and their first thing they'll say is what are you going to do? Ruby is slow. But that couldn't be further from the truth because as we just saw, the simple line of Ruby is hundreds of times faster. In this case, Ruby is not slow. Hitting the database is what's slow. Keep an eye out for situations like these in your code where it might be making a database request you don't expect. Now, some of you Rails folks might be looking at this thinking, not exactly running code like that. Actually, I chained a bunch of scopes to my where clause, so I have to pass the empty array, otherwise my scope chain breaks. Thankfully, even though ActiveRecord does not handle empty arrays well, it does give you an option for handling empty scopes, and that is the non-scope. None is an ActiveRecord query method that allows you to return a chainable relation with zero records, but more importantly, to do it without querying the database. So, let's see this in action. We know from before we run that where clause with our empty array, we're going to be hitting my SQL, and we're going to do it with all our scopes attached. If instead we replace that where clause with the non-scope, boom, we're no longer hitting the database. Be on the lookout for tools like these that will allow you to work smarter with empty datasets, and more importantly, never ever assume your framework or gem is not making a database request when asked to process an empty dataset, because you know what they say about assuming. Ruby has so many awesome, easily accessible libraries and gems, but their ease of use can lull you into a sense of complacency. Once again, make sure when you're working with a gem or library or framework, you have a general understanding of how it works. One of the easiest ways to gain a better understanding of how your gems or libraries are working is by logging. Set your login to debug for your frameworks, your gems, and every one of your related services. Then fire up your app, load some application pages, run some workers, even jump in a console and just start running some commands. When you're done, look at the logs that are produced. Those logs are going to tell you a lot about how your application is interacting with your data stores, and some of it might not be interacting how you would think. I cannot stress enough how valuable something as simple as reading logs can be when it comes to making optimizations in an application and removing useless data store hits. Now this concept of preventing useless data store hits doesn't just apply to MySQL. It can apply to any data store you're working with, such as Postgres, Redis, Elasticsearch, etc. Where we found it particularly useful at Kena was when it came to building what we call reports. So every night at Kena, we build these beautiful colorful reports for clients from their asset and vulnerability data. These reports start with a reporting object which holds all the logic needed to know how to build that report. Then every night in order to build these reports, we have to make over 20 requests to Elasticsearch and multiple requests to Redis and MySQL. We did a lot of work to make sure these requests were fast, but even so it was still taking us hours to build these reports every night. And soon we had so many reports in our system that we couldn't finish them all overnight. Clients were waking up in the morning and their reports weren't ready, which was a big problem. My team and I when we first started trying to figure out how to solve this problem, one of the first things we did was we decided to look at our reporting objects that we had. So once again, we jumped into a console and the first thing we decided to look at was okay, how many reports do we have? Just over 25,000. That was a pretty healthy number for us. The next thing we wanted to see was how big these reports were. So our report size depends on how many assets a report contains. The more assets in a report, the longer it's going to take that report to build. So we thought maybe we could find some way to like split these reports up by size. So we decided to take a look at the average asset count per report. Just over 1600. Now if you guys remember back to the beginning of the presentation, I mentioned that the average client has 60,000 assets. So that's 1600 number. That seemed kind of low to us. So the next thing we decided to look at was how many reports had zero assets or were blank. Hello, over 10,000. Over a third of our reports contain no assets, which means they contain no data. And if they contain no data, what's the point of running all these useless data store hits? If we know they're going to return nothing. Lightbulb don't hit the data stores if the report is empty. By skipping the reports that had no data, we took our processing time from over 10 hours down to three. Just by adding a simple line of Ruby, we were able to prevent a ton of worthless data store hits, which in turn sped up our processing tremendously. This strategy of using Ruby to prevent useless data store hits, I like to refer to it as using database guards. In practice, it's super simple, but I think it's one of the easiest things to overlook when you're writing code. Okay, we're almost there. This last optimization story I have for you guys actually happened within the last couple of months. So remember those rescue workers that I was talking about earlier. As I also mentioned, they run with the help of Redis. And one of the primary things we use Redis for is to throttle those rescue workers. So given our sharded database setup, we only ever want a set number of workers to work on a database at any given time. And the reason for this is what we found is too many workers working on a database would overwhelm it. So to start, we pointed 45 workers at each one. Now after making all these improvements that I just covered, we figured it was probably time to increase the number of workers we were running so that we could increase our processing speed. So we bumped the number of workers up to 70. And after doing this, we kept a close eye on our MySQL monitoring. But it still looked great. MySQL was happy as can be. My team and I, we went about the rest of our day. Pretty darn proud of ourselves at this point. But it didn't last for long. Because as we learned earlier, when you put one fire out, you'll start one somewhere else. MySQL was happy. But then overnight, we got Redis high traffic alerts. And when we looked at our Redis traffic graphs, we saw that at times we were reading over 50 megabytes per second from Redis. So that 7.8 from earlier, that's not looking so bad. This load was caused by the hundreds of thousands of requests we were making to Redis trying to throttle these workers, which you can see in this Redis request graph. Basically, before any one of these workers can pick up a job, it first has to make multiple requests to Redis to figure out how many workers are already working on that database. If it's 70, it won't pick up the job. If it's less than 70, it knows it can pick up the job. All of these hundreds of thousands of requests were overwhelming Redis and ended up causing a lot of errors in our application, like this Redis connection error. Redis started dropping important application requests because it was so overwhelmed with all of our throttling requests. Now you can bet, given what we had learned from our previous experiences, our first thought was how do we use Ruby to solve this? Could we cache the worker state in rescue? Could we cache it in active record maybe? Unfortunately, after pondering this problem for a few days, no one on the team came up with any great solutions. So we decided to do the easiest thing we could think of. We removed the throttling completely and when we did, the result was dramatic. There was an immediate drop in Redis requests, which was a huge win, but more importantly, those Redis network traffic spikes we had been seeing overnight were now completely gone. As soon as the load on Redis decreased, all those errors we were seeing before resolved themselves. Following the throttling removal, we of course kept a close eye on my SQL, but it was still happy as could be. So the moral of the story is sometimes you want to find ways to use Ruby to replace your data store hits. Sometimes you want to use Ruby to prevent your data store hits and other times you just need to straight up remove the data store hits you no longer need. This is especially important for those of you who have fast growing and evolving applications. Make sure you're periodically taking inventory of all the tools you're using and ensuring that they're still needed. And if you don't need something anymore, get rid of it because it might save your data store a whole lot of headache. As you all are building and scaling your applications, remember these five tips. And more importantly, that every data store hit counts. It doesn't matter how fast it is. If you multiply it by a million, it's gonna suck for your data store. You wouldn't just throw dollar bills in the air, would you? Because a single dollar bill is cheap. Don't throw your data store hits around, no matter how fast they are. Make sure every external request you are making is absolutely necessary and I guarantee your site reliability engineers will love you. And with that, my job here is done. Thank you guys so much for your time and attention. Okay, so the question was did your team have any issues with cache and validation? The answer is yes. When we first started, we would shoot for lower caching times and that would mitigate some of the load and then we'd try longer and longer caching times until we found the sweet spot where it wasn't so stale for our clients, but it was still actually we were seeing performance benefits. So it's definitely something that you're probably gonna have to test and maybe iterate over a couple times before you get the invalidation length of time you need. Another thing I want to say is make sure when you're adding things to the cache, everything has a validation. When our team was first formed the site reliability team, we didn't have automatic invalidations for some of our cache items, so we had stuff sitting in our cache from like years before and it added up, so make sure you're definitely invalidating everything. Yeah, so the question was if you are working in a load balanced environment, are you keeping a cache on every instance? And yes, in this case, the caching is all locally, so it's within the instance that is running and like I said, we're using objects, instance variables, etc to do it. So the question is, is there a way that you could automatically detect where you could place database guards? And the simple answer is no. That's why I kind of included that section on logging. That's honestly one of the best ways. If you can look at your logs, so we use Datadog, which we use the Datadog APM, which allows a lot of our, you know, the hits and requests we're making to go to Datadog and so a lot of times it's really easy. We can just jump in there and we can look at a job and we can say, oh, we're spending the majority of the time making 100 requests in my SQL. Well, what is that request we're making? So logs and things monitoring that are monitoring what requests you're making are going to help you to, you know, to look and see those those requests. And honestly, they're pretty easy to see when you have a request that's being executed over and over again, you look at your log file, that request is all you see. So the question was, have we considered using advanced stream service when sourcing to elastic search? I got that right. We have not. So maybe that's something we will look into. Thank you. Okay. Thank you so much, guys. I'll hang out if anyone else has any questions.