 So 240 on the dot, let's do it. I'm Godfrey. No one is introducing me today, so I have to do the work myself. But before we start, a very important question for you, did you file your taxes? In case you didn't know, the deadline was yesterday, but if you missed it, don't worry, there will be another one next year, so. But more importantly, are you in the right room? Because this is a sponsored session, in case you didn't know what it means, just making sure everyone is here at the right room for a right talk, no regrets. Yeah, all right, so, sponsored talk. So if we all know what we're getting into, let's do it. This video is bought to you by Blue Apron. You're busy, we get it, we're busy too, but that doesn't mean you can't have nutritious and delicious, ah, never. That, yeah, I figure I should burn the evidence before I get lawyers calling me. I actually work for a small company called Skylight. I don't know if anyone actually believes us when we say we're a small company, so we decided to put the entire company on stage this year, so hopefully people will believe us. But even though we're not sponsored by Blue Apron, if you do want to sign up for Blue Apron, I have a personal promo code that you can use. Speaking of personal referral code, this is my personal referral code on Skylight. If you sign up with this URL, we'll both be swimming in Skylight credits, so I'll give you a few minutes to copy that down. Got it, okay. If you don't want to remember the long one, you can also do skylight.l slash rl slash rl sconf, in which case I won't get any money, but you will. Anyway, if you like free money, you probably are into cryptocurrency, so we're announcing the Skylight ICO as our token of appreciation. You can find the details at skylight.l slash pepecon.ico. So, we have been very busy working on a lot of things in the past year, but instead of telling about all of them, today we decided to focus on just one, and that's Skylight for open source, and Vatihe will be here to tell you all about it. Your call is very important to us. Can you hear me? Oh, how about this? All right. Hello, my name is Vatihe, and as Godfrey mentioned, one of the things we've been working on is the Skylight for open source program, and it was born from our own experiences as open source contributors. So, many of us on the Skylight team actively work on open source, and from our own experiences, we've seen that it can sometimes be pretty hard to get existing contributors to actually work on application performance issues. So, only a few contributors actually consider working on performance problems to begin with, and even the ones that are already interested can find it hard to know where to start. And that's why we created the Skylight for open source program. This is, of course, the same Skylight that you know and love, except that open source apps will have their public dashboards accessible to everyone. That way, all your potential contributors can access this performance data easily, even if they don't have a Skylight account. We started this program because we hope that by making performance data and information more accessible, we could inspire potential contributors to tackle those really slow parts of their favorite open source apps. And by leveraging Skylight, they could finally have a good way to see if their own contributions actually helped. Because at the end of the day, every single app, large or small, open source or not, has room for improvement when it comes to performance. So, after spreading the word about the Skylight for open source program, we decided to try contributing back ourselves. We had a company-wide osathon. Yes, osathon. It's like a hackathon, but it's for open source projects, as you might have guessed. And we paired up to help make performance contributions and improvements to some of the apps that are participating in the Skylight for open source program. So, today, we're gonna show you three different open source apps that are running on Skylight, each of them with their own unique performance challenges, varying when it comes to complexity. So, to start, I'm gonna hand it off to Yehuda, Lee, and Peter, who will introduce us to the very first app. Hey, Peter, wanna see something awesome? Sure, what's awesome? Well, I'm glad you asked. Awesome stands for open source event manager. I have no idea if it's actually pronounced that way, but I'm going with it anyway. Awesome is an event management tool tailored to free and open source software conferences. It's a one-stop shop to set up a website for your conference, handling everything from program registration, sponsorship, as well as the admin interface for editing all this content. Since awesome is designed to be a self-hosted solution, working on performance issues could be tricky, since you'll need real data and traffic to understand how the code is performing in the real world. Luckily, one of the maintainers of awesome is running a conference later this month, which is LinuxFest Northwest 2018. So, you can see awesome working in action on the website, and on Skylight too. This is the Skylight dashboard for the LinuxFest website, which is powered by awesome. On the top, you'll see a graph showing some key performance metrics over time. The blue line represents the app's typical response time. The purple line is the problem response, and the gray part at the bottom is the number of requests per minute. Now, you're probably wondering what is the typical and problem response time. The technical name for these metrics are the median and 95th percentile, which are a pretty good proxy for what a typical user of your site would experience on a typical day versus a rainy day. However, why do we have to show you two numbers? Can't we just show you the average response time here? That is a very good question. Hang on to that thought for now as Yuhuru will explain in a minute. Not yet. For now, let's move to our attention to the second half of the page, the endpoints list. This shows you all the endpoints in your app. They're typical and problem response time as well as popularity. By default, they're ranked by Agony, which helps you prioritize which performance issues are worth spending your time fixing. It takes into account both how slow the end point is, as well as how popular it is. So, you won't waste your time fixing a slow end point that's only accessed by a handful of people every day. Those of you with good eyes might also notice the red heads-up icon. This is next to the endpoint name. This indicates potential database issues. But what kind of database issues? Yuhuru will now tell you all about it. So, to find out, let's click on the first row to see the detailed endpoint page. So, the first thing you're probably gonna notice here is the histogram on top. What this shows you is a distribution of the response times for this endpoint. This is a pretty good time to revisit Lee's earlier question about typical and problem response time versus averages. If there's one thing you remember from your statistics class, it's probably the normal distribution, aka the bell curve. And you might expect your response time distribution to look something like that, matching the bell curve you learned in school. The normal distribution has some pretty nice properties, which is why they teach it in school. For example, if your response actually did look like this, the average will be right in the middle, making it a pretty representative number. And if you did look like this, you would expect a lot of your visitors to experience response times clustered around this number, and that would also be pretty nice. However, response times in the real world do not tend to follow this distribution. They tend to follow a distribution more like this, which is skewed to one side. A lot of the requests in this distribution will be fast, but there is also a long tail of slow requests. In this case, an average just doesn't tell you anything about your fast requests. And even worse, it misleads you about your slow requests, which are usually around 10 times slower than the fast ones. Now, if you ask me to tell you where the average of this particular distribution is, I actually have no idea where to point to, which is why instead of showing you the average and misleading you into thinking about the bell curve, we show you the typical response, which covers about half of all of your requests, and the problem response time, which covers about 95% of your requests. Now, it's important to remember that one in 20 requests falls outside of this range, so basically all of your users are going to experience slowness on the right side of that bar pretty regularly. And so we show you both numbers to remind you of what your users experience in reality in the real world. So that's the first half of the page. The second half of the page shows you what things your app actually spent time doing during the request. For example, as you would expect, most of the time in the request is spent here processing the controller action, which is indicated by the blue bar. However, a lot of time is spent doing other things within the action, such as database queries and rendering views, which is why parts of the blue bar are painted in light blue. The child tasks inside the light blue are also color-coded, green is for a database-related stuff, and purple is for a view rendering. You can click on the event to view more details about it, such as the actual SQL query being executed. And if you remember, the endpoint page we saw earlier had a heads-up icon referring to a potential database issue in this endpoint. It corresponds to this red icon over here, which refers to a pretty common kind of database issue in Rails apps called mplus1queries. And Peter is going to explain what is an mplus1query and how you can fix it. Thanks, Jehuda. I imagine that probably many of you have heard of mplus1queries before, but if you haven't, don't worry because I'm about to explain them. So let's say you built a blog in Rails. You'll probably have some code like this in the controller. Obviously you can see we have an index, we're selecting our posts here, and basically gonna send that off to the template. Since this is a blog, you probably also have authors where each post belongs to an author. Finally, you might have something like this in your templates. You're gonna loop over each post and render the author's name and probably some other data about the post. This is works, and by however, this fairly benign looking code is actually hiding some important work from you. When you access post.author, Rails doesn't yet have the author data. It only has loaded the post itself, so it has to do an additional query to fetch the author from the database. Since we have to do this for each post that we render, we'll end up issuing 10 separate author queries to the database, in addition to that additional one that we had to do to fetch the posts. This is why it's called the M plus one query problem. Because we know we are going to need the authorship information anyway, we can optimize this by fetching them ahead of time using just a single additional query. Rails provides an API to do this for us automatically. You just change the controller code to add this includes here. Probably many of you have done this, but if you haven't, this is a really useful thing to do. And that now basically is when you select from the post, we do this first one query and then we're going to have a second one where it gets all the authors in one query. And admittedly, this is only the simplest possible scenario. In the real world, you'll often find your M plus one queries are hidden deep in the code. They may follow the same pattern, but it's somewhere that's not quite as obvious and straightforward as this. Fortunately, Skylight does a pretty good job of detecting and highlighting these problems, pointing in the right direction of where to investigate further. Going back to awesome, we were able to use the information from Skylight as a starting point. We found a couple of places where M plus one queries were an issue and submitted a PR to address them. As our time is pretty limited here, I won't get any more of the nitty gritty details. However, if you're interested in learning more, feel free to stop by our booth. The exhibition hall will be around later today and we'll be around tomorrow. And we'll be happy to show you and talk to you a bit more about how we found some of these problems when we did to fix them. But for now, I'm gonna hand this over to Kristen and Zach to talk about the work they did on another open source app, the Odin Project. But Moji, thank you, Peter. We're gonna talk about the Odin Project. The Odin Project is an open source community and curriculum for learning web development, like a code school basically. Students of the Odin Project build portfolio projects and complete lessons that are constantly curated and updated with the latest resources. They offer free courses like Ruby, Rails, JavaScript, HTML, and CSS. Once you're done climbing the technical ladder, they even have a course on how to go about getting a job in the industry, walking you through things like searching, interviews, and much more. One challenge of running an app like the Odin Project comes from the size of the community. As you can imagine, there's a lot of information to keep track of for each user, such as which courses they took and their progress in each course. With over 100,000 users, this adds up very quickly. For example, let's take a look at the work involved in rendering these beautiful badges. As you might imagine, you need to know two things to be able to render a badge. The lessons belonging to the course and which of those lessons the student has completed. The first part is pretty straightforward, just a simple select query. The second part involves a join table called lesson completions. It's pretty standard too. There is a lesson ID column and a student ID column. Since we are showing badges for every course on the page, in order to avoid the M plus one query problem, we are doing a single query to fetch all the completion records for the current student. However, according to Skylight, the second query takes up a noticeable amount of time on each request, not terribly slow, but definitely noticeable. So the question is, can we make it faster? Now, as you may know, databases rely on indexes to keep these kinds of queries fast. Essentially, you do a little more work when you're inserting the data in exchange for a significant performance boost when querying. In a lot of cases, the answer is simply to add the missing index. But this case is a little more nuanced than that. Because a student shouldn't be able to complete the same lesson twice, there is already a compound index on lesson ID and student that guarantees its uniqueness. Since we're only querying on the second field, is this compound index sufficient for our query? For a long time and for most databases, the answer would have been no. However, starting in version 8.1, Postgres is sometimes able to take advantage of the compound index for queries involving only the second column in that index. Unfortunately, even when that is possible, it's still not as efficient as having a dedicated index for that second column. We can confirm this by running and explain, analyze on the queries. If you haven't used this before, it's a great tool in Postgres to understand what the database is doing under the hood in order to help you optimize your queries. As you can see, if we have no indexes at all, Postgres will be forced to do a sequential scan, which could be fairly slow on a large table. Now let's add the compound index and try again. As you can see, Postgres is using the compound index here, but this is still not as fast as it should be. Finally, we'll add the dedicated index for the student ID column. As you can see, Postgres is now able to fully take advantage of the index and improve the query performance noticeably. Note that none of the supplies are the first column of a compound index. If we were to query on lesson ID alone, Postgres would be able to use the compound index just as efficiently, so there is no need to add a dedicated index for that case. Here's a summary of what we have learned. First of all, Postgres is generally very smart about using multiple indexes for a single query. So you should always just start with individual indexes. However, if you already need a compound index for other reasons, it gets a little more complicated. You would want to make sure that you prioritize the field you want to query separately and put that first in the compound index. Alternatively, you can maintain a dedicated index for any additional columns by which you would like to query. It says give time to read. We were able to apply what we learned here and submit a PR to the Oden project and improve this query's performance. While we were working on this, we noticed a similar case this time on the lessons table. Here we have a compound index for slug and section ID, and we wanted to query on section ID. So we thought we could add a dedicated index for section ID as well. However, when we ran explain analyze, we noticed that Postgres wasn't using the new index. As it turns out, the lessons table was fairly small. Less than 200 rows in total. When dealing with such a small table like this, Postgres will be able to do the sequential scan much faster than using the index. So if we had added the index, it would have just been more work at insertion time without any benefits. The moral of the story here is don't optimize prematurely and always check your assumptions. It's often better to just do the straightforward thing and let skylight tell you what's slow. Then spend time investigating and optimizing your code. Ultimately, we were able to figure out why the second query was slow and send a second PR to fix it. However, that's a story for another time. If you're curious, come ask us about another booth after the talk. For now, let's hand it over to Vitey and Godfrey to talk about the open source app that they helped improve. Are you ready? Click when ready. Yes. Click when ready. Let's do it. Are you ready? I can't click till you're ready. That's good. It's ready enough. All right. Thank you, Kristin. So we decided to work on one of our favorite open source apps called CodeTriage. What makes CodeTriage special is that it's an open source app that's built to help other open source projects. So as you might know, popular open source projects receive a lot of bug reports, feature requests, and pull requests every single day. And just reading through all of them could be a huge time sink for all of these projects' maintainers. And we're involved in a number of popular open source projects ourselves, so we understand this problem pretty well. CodeTriage lets you help out your favorite project maintainers by signing up to help Triage, GitHub issues, and pull requests. Once you sign up, the app will send a random ticket to your inbox every day. That way, you can help split the workload so that everyone only has to do a little bit of work every day. You can even sign up to help Triage and improve documentation on the project, receiving one method in your inbox every day. As you can see, CodeTriage is actually a pretty cool app, and there are already tens of thousands of users helping out thousands of repository projects this way. So that's pretty cool. Wow, that's a lot of projects and users. That's right, buddy. And as we can see, Ryan App at this scale will create some pretty unique and interesting challenges. Take a look at this homepage, for example. This is what you see when you go to CodeTriage.com, right? And it lets you browse through some of the most popular project and shows you how many open issues that need to be Triage for each one. That's a lot of information to render and so it could get a little bit slow at times. And at the same time, this is also by far one of the most popular page in the app because this is what everyone sees when they go to the main website, right? So Skylight is understandably marking it as an high agony endpoint, meaning that we'll probably get some pretty good bang for a buck's optimizing this page. Notably, even though there's a lot of information on the homepage, most of those things actually don't change all that often, which actually makes this page a prime candidate for caching. But as it turns out, a lot of performance-minded people have actually already worked on CodeTriage and they've done extensive work in fine-tuning everything that needs to be done. So therefore, a lot of the things that we found that should be cached already were. However, when we looked at this page on Skylight, we noticed that in order to populate the meta tag, we needed to run two queries to fetch the counts of the users and the projects on the site. Since we probably don't need these numbers to be super up to date, we can get away with caching them. So we submitted a PR to cache this information for up to an hour. And as expected, when we deployed this change, we were actually able to shave off a little bit of time for the homepage. So other than the header and the footer and the meta tag, the bulk of the time is spent rendering these cards on this page, which is understandable. This might look pretty simple, but actually a lot of work goes into rendering these cards. But however, even for the most popular open source project, they only receive new issues a handful of times per day, maybe up to a handful of times per hour. So they don't actually, most of these cards doesn't actually change that much. And you're probably thinking the solution is probably fragment caching because we're using Rails, right? So as we mentioned before, a lot of the obvious performance things on this app has already been done. And of course these cards are fragment cached. So that seems good. But in fact, they are better than fragment cached. They are actually catching in a pretty smart way. So if you just learn about fragment caching, you probably would do something like this, where you would loop through each of the cards and then you will catch each of the card inside the loop. This works, but it has a problem. This would result in Rails going to your catch store, sending a network request to your catch store to get the fragment for each of these cards every time it renders it. So now we're looping through like 50 cards there. So that would end up making a 50 separate requests to your catch store, which could add up to a lot of overhead. This is kind of like the N plus one query problem for catching, right? So to solve this problem, the code reaction maintainers took advantage of a Rails 5 feature called collection catching. So instead of doing this, what you do is you do render partial, like you move the content into a partial and you do collection, give it the array and you say catch true. So by doing this, Rails will be able to fetch all the fragments it needs ahead of time in a single request to the catch, cutting down on a significant network overhead. And you can see this reflected on Scala as well, the catch read multi there. That's the single request that Rails is sending to the catch store to get all of those fragments. So admittedly, this is pretty cool, but as we looked more closely at this, we realized that something was wrong. If you look at the logs, you will see something that looks like this. And as you can see, for each request to the homepage, we're missing more than half of the 50 cards in the cache. If this is a one-time occurrence, it probably wouldn't be a really huge deal. Rails would simply generate the missing content and write them to the cache. Then any subsequent requests would be able to take advantage of that cache content. However, the strange thing is that we were able to consistently reproduce those cache misses, even when processing multiple consecutive requests to the homepage. So perhaps the cache was way too small. However, the math doesn't really check out. We had provisioned a 100 megabyte memcache store for this app. Sure, it would have been nice to have a bigger cache, but each of these cards are around a few hundred bytes. So even at this size, it would still take at least a couple hundred thousand of these cards to fill up this entire cache. If this square is the 100 megabyte cache space, the 50 cards would probably occupy a fraction of a pixel in this one square. So this was definitely not our issue. It is possible that there's something very big in the cache that's taking up all this space. However, our mental model is that memcached is an LRU, least recently used store, which means that new cache data will push out old cache data in a first in, first out manner. But since we're not caching anything big on the homepage itself, our mental model would suggest that memcached will evict older items in the cache as needed in order to make room for our cards. At this point, it's pretty obvious that our mental model is wrong. And we need to get a better understanding of how memcached actually manages its data. Now, before we get into this, I should probably mention that we're not actually running memcached ourselves. We're using a size provider that implemented a proprietary memcached compatible store. So some of these details are specific to the implementation and it's not applied to memcached. But that's like more in the nitty-gritty. Most of the things here apply to memcached in some ways. So the first thing we learned is that our cache store is not just 100 megabytes of free space that is available for everything. Memcached actually groups the data you're trying to store into different tiers based on the sizes and manages the space for each tier separately. So you can see this in action by running the stats items command in memcached. The output looks something like this. And here you can see that there are a lot of stuff but it's basically telling you a few things. You can see the sizes for different tiers. You can see the number of items currently stored in a particular tier, and as well as how many items have been evicted for that particular tier. Since all the card fragments that we're looking at on the homepage generally is around the same size, they will end up being catching the same bucket, specifically bucket number four. We can see that the bucket number four is responsible for storing things that's up to a kilobyte. And the tier before it is responsible for storing things up to 512 bytes. So this end up being the tier for anything that's between 512 and one kilobyte, which fits all of our cards basically. And you can see that we currently have around 5,000 items in this tier, which works out to around five megabytes of storage, which is much smaller than the original 100 megabyte that we have in our head, that we assumed was available to us. So the implication here is that when talking about similarly sized items in MAMcatch, they might be competing for much smaller amount of space than you expected. Okay, so, sure, we only have five megabytes to actually work with, which is a lot smaller than we originally thought, but still, five megabytes is enough space to hold 5,000 cards. And if we look at our homepage, we are only trying to cash 50 cards. So once again, the math doesn't really add up. So this isn't quite the end of our story. However, from the same stats command that Godfrey mentioned before, you can see that the eviction count is indeed very high on this bucket. We've evicted almost two million items from this one bucket alone. So this supports our observation that our cash is overflowing. It's spilling out things as fast as we put things into them. At this point, the only explanation is that there must be something other than our cards in this bucket that's taking up all this space and somehow they're not getting evicted. It would be helpful, of course, if we could just open up this bucket and see what those things are. Unfortunately, the SaaS provider that we were working with didn't offer that ability. But based on our research, we had an educated guess. There's a feature in Memcached that automatically expires cached entries, which is also known as TTL or Time to Live. Our theory was that when mixing TTL data and non-TTL data within the same bucket, this cash store would always choose to evict non-TTL data before any of the TTL data was ever considered for eviction. And since our cards didn't have a TTL and our cash was already pretty full, they would always be the first ones to be evicted. So what we ended up doing was flushing the entire cash manually. And sure enough, it freed up space for our cards. And this allowed us to achieve 100% cash hit rate on the homepage, and that resulted in a pretty dramatic performance improvement. So another good thing that came up this investigation was that we actually discovered some bucks in other open source projects too. If you look at the Skylight Trace before we flushed the catch, you will see that, yes, we are doing one catch multi, so we avoided the anchories problem for catching, but we are immediately doing, spending a lot more time in catch read and catch write. So what's going on? Looking at the logs, we'll see that this is the read multi, so it's a very long line, so I didn't fit it entirely on the screen, but it's basically fetching all 50 fragments in one request. But then immediately after that, when we're rendering the missing rows, we end up doing another read to the cash store, even though we already know there's nothing in the cash store. So because the cash store didn't return anything in the original read multi, that's the whole point of that request. So in other words, the read time here, which by the way, is longer than the original read multi combined, is completely necessary, and we should be able to eliminate this bar entirely, which is taking up a noticeable amount of time in the page. Even when we don't have 100% hit rate, which happens in the real world. And if you keep reading the log, you will also see a second problem, which is that, so on the top is basically we're reading, we do the unnecessary read to get the non-existing data from the catch, regenerate it and rewrite it back seems good. But then after the loop, we're immediately writing every, like all of the missing rows that we just wrote to the catch again, for no apparent reason, in the same request. And which means that we should also be able to cut this time by roughly half. So finally, as we're working on this, we also discovered a bug in Rails 5.2 that is causing each catch item to take up twice the space than necessary. And we believe all of these are bugs in either Rails or the Dali gem. And before we talk about flight to RailsConf, we're working on patches to fix these issues and hopefully we'll finish them when we get back to Portland. But notably the particular bug about Rails 5.2 compression has already been fixed and merged on master and you should expect to find it in a Rails 5.2.1 release very soon. So we initially started by looking for ways to help our favorite open source apps using Skylight. And I think we did accomplish that, but we also ended up accomplishing something kind of bigger than that too. Through this journey, we ended up finding ways to contribute back to upstream projects like Rails. And we hope that the Skylight for open source project and our, like this journey in our Ossathon will help inspire more contributors to do the same. So speaking of journey, I'm actually gonna be giving a talk tomorrow about the Rails routers internals. And it's going to be a wonderful journey. Am I really, I'm trying to drive at home. I don't know if it's working. But I hope to see you all there. All right, so that's something to look forward to for tomorrow, but for now, since we promised to bring the entire company, here is our CEO Lea with a baby. I usually bring him on stage with me, but he was so quiet and wonderful. And I don't want to mess with it. I chose this bit moji because that's what CEOs do is show up at the end after you've done all the work and just take all the credit. So thank you all, that was fabulous. Yeah, this is our team. We are a small, I want to say a little family, but I feel like all companies call themselves family and it has no weight anymore, but we do also invite our actual families to work with us. For example, we have a babies at work program and Jonas, which is my son over here and Kristen's daughter hang out with us at the office all the time, because why not? Turns out it's not actually impractical and it makes everybody happy because you walk down the hall and there's babies babbling and smiling and stuff. And I did write a blog post about it, about our babies at work program. So if you are trying to attract talent, parents, certainly women specifically, mention it to your HR person who will look at you like you've lost your mind, but then point them at my blog post because I talk about, I start out basically like, here's all the reasons, I thought it was a terrible idea, I was wrong. So we build skylight, we work really hard on it, we like giving back to the community. We are very particularly focused on making performance a thing that everybody on your team can be involved in. A lot of people, especially towards the beginning of their programming careers have this conception that this is the sort of thing that happens in dark basements with like the super ops people that they don't know or could never be. And I guess in some cases or using some tools, it could be true, but one of our missions is to make that not the case to make it so anybody on your team can have a spare cycle of 15, 20 minutes, drop into skylight, figure out what they can do, find some low hanging fruit. We have a lot about low hanging fruit outlined in our documentation site. For example, if some of these things that you saw here are not things that you know how to do, many of them are in our doc site, just sort of pointing in the right direction to be able to fix it in your own app. So yeah, we have a booth, we're downstairs, we're hard to miss. I was supposed to, oh. It's a space thing. I was supposed to make a joke about how professional we are. We're very professional. We make a lot of jokes. We make a lot of puns. If your pun game is not up to snuff, I don't know if we're a good place for you, but you can work on that. This is a skill that can be built. And yeah, this is us. We're excited to meet you. We hope you use skylight on your paid projects, on your free open source projects, and we'd love to talk to you about it. So come on by. Thank you all. Thank you.