 So hello everyone, Akira from Japan, I'm presenting about something about Rails, I mean this talk is about Rails performance, but I'm not going to talk about general application programming technique like eliminating Mplus1 or using Rails cache or things like that. Instead I'm going to focus on how we can find real problems inside the framework and how we can approach them, right? So the title says Rails applications, but the main focus is not really on our application code, but more on Rails itself. So today I brought some actual hacks that as Rails plugins, so I can share them with you and maybe you can check all later, right? So I have a question, is your application fast enough to think so? My answer is no, for me, almost always no. So it's like this, when you start the project it runs fast enough, but as the application grows at some point it becomes really slow. I guess this applies to everyone of you here, right? So is that essentially because Ruby is slow? I think no, Ruby is actually already doing very well. I'm sorry, I should have made the slides more down, sorry. So the real problem lies in the framework architecture, I think, and some very slow components of Rails. So here's the diagram showing actual applications performance, not actually from my actual Rails application, but I'm sorry, download this image from Skylates website without asking them. So I'll delete this if they don't like it. Anyway, this diagram shows that it executes the first query and the second query and the next query in serial order. This means these are actually serially executed in the main thread. For example, while querying to the DB database, Ruby is just waiting. In other words, these are all blocking operations through the main thread. So what if we can perform them without blocking the main thread, like in parallel work, like in a non-blocking manner? So here is the menu of today's presentation, I have five topics. So to begin with, let's start my presentation with this simplest one. API calls. So what I mean by API calls here is usually done by HTTP. Could be invoking some outside API or like microservices. So when you introduce microservices to your application, it would usually add some extra network overhead and make your application actually rather slower. So there's this real problem. Calling external API makes your application slow because it costs like HTTP network cost. So this happens because the API call blocks the main thread while requesting to another HTTP server. And the CPU does nothing while waiting for the response. So how can we make this non-blocking instead? A broad some example. This example is a very simple example that the client makes three requests to a very heavy API. Like it takes one second. The API looks like this. Just sleeps one second. And here's the client code. Just calls the API three times. So simple. So here's the result. It takes three minutes, I mean three seconds. So with this, I could kind of emulate the slow API problem. So how can we fix this problem? It's a very simple Ruby code. Just wrap the API call with the thread new block like this. And now it finishes in one second. So this is something called future pattern kind of. If you use a very new thread, it immediately starts running in background. Then if you call dot value on the thread object, it waits the thread to finish. Then returns the value back to the main thread. And you can do anything else in the main thread while the API is executing in background. Usually we don't use the raw thread object, but use something like this, another object around the thread. So anyway, it's called future pattern. With this basic idea in mind, let's proceed to the next topic. This is a very simple example to push an IO blocking task to a child thread and do something else in the main thread. So the next topic is boosting the database queries. Database queries are so time consuming, which is obvious. And it's essentially just another kind of IO blocking task. While querying to the database, the main thread is just sleeping. So this is how ActiveRocket handles database connections. Basically, ActiveRocket checks out a connection from the pool per each thread, including main thread. One request uses only one connection, but actually the connection pool has many more pooled connections. So we can probably use these extra connections for all the threads other than the main thread. So here's this problem. Database query blocks the main thread, and when you throw query to the database, you need to wait. And we already have a good solution for this problem. Maybe we can apply the future pattern to this problem as well again. Now let me show you an example. Consider we have a very heavy query again like this. It's a very simple example. Well, when it selects a user, it sleeps just the same seconds as the user ID. So it's a silly example, but you know, the idea. So actually it takes three seconds to select user one and user two like this. And using thread, we can do this in two seconds. This is great. So why doesn't ActiveRocket act like this by default? Because there was some problems with this approach. As I told you before, each thread automatically establishes a new connection, grabs a new connection from the connection pool. So if you establish too many new connections, the database pool will dry up very easily. So this is the problem. So we have ActiveRocket actually has the withConnection API to use like this. Whenever you run a query inside a child thread, just wrap them with this connection pool.withConnection block. So it automatically checks out the connection when thread is joining. So when we show the connection pool stat at the last line, it says that we have five connections in the connection pool and we used three connections for this operation. So I actually implemented something that does this for ActiveRocket as an experimental plugin with these two APIs. First is ActiveRocket relation.future. The highlighting is broken. Like current user.post.future. If you call .future method, then it starts querying in the background. Then if you call .value, sorry, .records, then it returns the asynchronous query results. So it's already on GitHub. So the repository is here. It's named as future records. So you can check it out. It kind of works, but I don't think it's totally production ready. So please be very, very, very careful if you actually try it. Maybe we need to add some thread pool instead of calling thread.new every time. I'm going to explain about what is thread pool later. By the way, there could be several other approaches for this slow query problem. First is share only one connection that is used from Winthread but pass the connection to other child threads or use asynchronous connection API. The first idea is sharing the main connection and pass the connection to the child thread and do something else in the main thread while the child thread is querying. Maybe we can use fiber for this, but I tried it but it was so difficult so I couldn't finish. Or another idea is for some database authors like MySQL 2 or Postgres, they have asynchronous query API. It goes like this. If you query through the asynchronous mode, it immediately returns nil and runs the query in background and it returns the result. If the query is finished, you can get the result by a simple result method. So in order to use this in your application, you need to create a mechanism to detect the query exactly when the query is ended. I could kind of make this work locally, but it requires super crazy hack on ActiveRecord. I needed to use Event Machine, which is so complex. I don't really recommend you try this, but if you want to try, there's an existing project called EM Synchrony. So check it out. So anyway, we're moving on to views as the next topic. Next topic is action view. We often have very slow partial templates because render partial, of course, is again blocking the main thread. In the most cases, partial does not share anything with other threads, I mean other partials, so maybe we can render this synchronously. Maybe with agents. So I did this. Here's my implementation. Add remote true option to render partial. Then it throws the AJAX request, and it kind of renders back AJAX or BJAX or something. Actually, I did this five years ago, and I realized this is a bad approach. I created this, but I do not use this in production anymore. So instead, let's think about simply threading again, future pattern again. So here's the initial implementation of doing action view render partial in the background. If you pass async option to the render method, it runs something, it wraps the render partial method with future object, I mean future pattern object. So this is how it works. So let's try this. By simply adding sleep one to each partial, then let's see how it goes like this. You see we have an action, A action, and A action renders B partial from A.HMRB, and each A view and B partial sleeps one second. And the result is, sorry I forgot to paste the result, but it kind of finishes in one second, and it returns the correct HMR, so it's done. I'm sorry, not done. I'm sorry, it took two seconds actually. So we've got no performance gain. I used thread, so why didn't this become faster? So let's see what's actually happening inside the framework. Well, action view compiles each template and partial templates to a Ruby method to put some debugging code in action view. It outputs something like this in a console, and there in the middle it calls render B, and it passes the result of render B to outputBuffer.Append. And outputBuffer.Append is implemented like this. So what I did was I created a future object by renderAsync, then immediately appends the future object to the buffer, and the buffer does this. It calls .2s to the given option. So outputBuffer.Append, renderPausal, it creates a future object, then immediately calls 2s to the partial object. So then it causes the background thread to join. Why does action view calls 2s inside the buffer? Because outputBuffer is a string, and you need to make sure that anything added to the string is a string, or it would cause some kind of unexpected behavior. You cannot append a symbol, or if you append an integer, it's going to be translated as a code point. So you need to call 2s before appending anything to the string. So how can you make the future object live longer in the buffer, not calling 2s immediately? What if we store the view fragments in an array instead of string, then concatenate all of the fragments when returning HTML response body? The idea is like that. And here's the implementation. It's called ArrayBuffer. I overrided the concatenate method and stored the given fragment in its array. Then if 2s is called, you just kind of join them all. So let's take a benchmark again. Now I did this, actually, and now it returns the result in one second. But please know that real-world templates usually do some more CPU-bound operation than just sleeping one second. So this is just a demonstration. By the way, if you're looking for the fastest template engine in the world that is built on string-based string buffer, there's an implementation called StringTemplate. I'm sure it's the fastest. You can find it here on my GitHub page. It compiles the whole template in one very long string literal with string interpolations. So it creates only one string instance per one template. So it runs very fast, at least on some microbenchmarks. So anyway, now let's see how the array-based version of buffer scales. Here's what is created by a generated scaffold. First, it extracts this repetition part from the index.html to another template on impartial like this. And I put something, again, to sleep randomly in the partial. And I've registered 10 users to this scaffold and requested by a browser. And the result goes like this. Or sometimes it returns some weird error like this. A thread, thread something error. So what the hell is happening here? Of course, you know, it's called race-condition. So why does this partial causes race-condition? Can you find the answer? Because it shares something within threads. It shares an instance variable at sign-output buffer between threads, right? So it causes race-condition. So we need to change the buffer object to be a local variable or thread, local variable or something. So we need to monkey-patch action view. So we need to monkey-patch the template handler first. I'm not going to paste the whole patch, but I've done this. I had to do so many of this, but I did this, actually. So this is how we fix, kind of fix the Ruby template handler. And let's do it again. Let's try rendering it again. Actually, with this fix, the former example perfectly works. It outputs the partials in correct order. So now let's render something else. The next example is new.html.rb rendering form partial from inside. Now let's try to render this with async option. So now it renders a broken HTML like this. So why this happens? This happens because of action views capture helper, which uses, which calls something called capture inside, and which uses the buffer, which actually creates a new buffer and throws it away somehow. And so you can get only the content inside the form for block without adding anything to the main block. So I'm sorry. So I fixed this somehow. I kind of made a horrible hack on action view. And I kind of emulated this only with local variables. So with this batch, we can run as many threads as possible at once. But if you run hundreds or thousands of threads, actually the response time goes bad because threads is costy, and switching between threads takes some time for review. So in order to make it faster, we need to control the number of running threads or the number of established threads. In order to do this, we'll introduce something called thread pool. So of course we could create our own thread pool implementation, but actually Rails ships already ships with this implementation inside concurrent Ruby gem. So we just can use this. So with this thread pool, I finally kind of finished implementing an async partial renderer. But we still have to monkey patch all these other template engines like Slame, Ruby, Hamel, et cetera, et cetera. And especially monkey patching Hamel is so horrible. I mean, I probably shouldn't say this, but as a main maintainer of Hamel, it's a template engine, but Hamel code base is just horrible. So I couldn't finish monkey patching Hamel until today. So I'm sorry, this gem is not actually finished. So anyway, the code is placed here again on my GitHub page. So we can check out. Oh, and I talked about all these template engines rendering HTML files, but what about JSON renderers? We have this default JSON renderer called Jbuilder, but unfortunately it's completely not working because it's another horrible template engine. But I suppose you're not using Jbuilder anymore because we have a better alternative to Jbuilder called Jb. Jb, of course, works perfectly with my asynchronous renderer, I mean, Array Buffer, because it's implemented in very good manner. So check out the Jbuilder template engine. I mean, Jb, you can find Jb here. So anyway, let's start moving on to the next topic, lazy attributes. So let's move on to the view code inside the view code and find what's slow there. I'm creating another example, another realistic example like this. Scaffolding 100 columns, and I created 1,000 records per 100 columns. So obviously very much real world-ish, right? So we can run this by like this curl and I run this five times and we get results like this. It takes around 1.6 or 1.7 seconds per each request and you see mostly in views, right? So let's see what's happening in views. So what if we change the attribute accessors to literals, to string literals like this? Just like quoting. So the result goes like this. The score improves to be about 800 milliseconds. This means half of the response time was spent on just reading from already selected active record instance, right? So it should be as just a method called. Why does it cost that much? So I counted the number of method calls when calling the attribute accessor. Then it turned out that only one method called actually calls 13 methods inside one attribute access. And if I call a date time, time stand attribute, it makes 30 method calls inside the framework. So looping 1,000 records and accessing 100 columns, it kind of makes this much method calls? Yes, it does. So this is why active records is slow, right? Because it's because the framework is written to be slow. So, of course, you're not doing this in your real world application. You're going to use page narration with something like this. But anyway, there are actually, there are some use cases in real world applications like APIs or like FinTech application or something like that. Actually, in fact, we hit this real problem in our application called money forward. We had to render this much models, and it took like 10 seconds, really. So, in my opinion, active record model is designed to do too much work. It's too rich object. Actually, it implements two different roles in one model class. One is dated transfer object, and one is to be a form object, right? The first former is, can be read only, and the latter has to be a read and write object. So you need to do validations and typecasting and et cetera, et cetera. So, and what we need in this scenario just reading is, what we need is more like lightweight read only object, like dated transfer object in like something like entity being in Java or something, right? So, you can create something like that simply with Ruby struct or something, but probably you don't want to do that because it's not going to play nice with decorators. I mean, with this plug-in. So, instead, why don't we just store the attributes as a hash instance inside the model object, like active record used to be like that. So, let's try this. To solve this real problem, not by adding more complexity, but retrieving back the simplicity of active record two or three. Again, we need to monkey patch active record because recent version of active record implements the attribute API. It's a very good feature indeed, but it's so heavy actually. To be fair, we actually use this feature in Money Forward for some columns, but I think it's very rare case, I think. So, inside the attribute API, it creates each column object per each active record model instance. That is why it's so heavy. So, why don't we opt out this attribute API feature? Like, not to create column objects by default, but only for these users who actually use this attribute feature. So, here's my implementation. If the model declares no custom attribute declaration, it returns active record three object. So, here's the implementation. First, I made a simple attribute set object that delegates to a hash object, then attribute set builder object that builds that lightweight attribute set object, then override attributes builder to return the lightweight attribute builder object if the model has no custom attribute. And here's the result. Again, it used to take 1.6, 1.7 seconds for each request, and now the result improves like this. It improves like 40% faster. So, I'm sorry, it's still not projection ready. I need to implement some typecasting when reading, but I'm sure it would still be faster than the default active record object. I'm going to skip this. Again, I have one more thing to cover, the named URL. Now, now the model is fast enough, then what's slow next? The answer is links, these links. If we just delete these links, the result goes from this like to this. It improves like 35%. So, the problem we found is that named URL is slow. And the solution for fixing this is already here. I mean, if the buffer is array-based, you can just make the link to call to be a future object. So, it's kind of done. Or another solution is that maybe we can cache the results in memory using Rails cache or something. I actually created this two years ago, so check this out. So, conclusion, sorry for the overtime. So, I'm going to close my talk with revising what we learned today. First, we learned we have so many slow things in our applications and maybe we can solve some of them with Ruby threads, maybe. And you can find what's slow in your application and you can fix it. If the problem lies in the framework, go ahead, hack the framework. Next, to investigate the performance problem in the framework, perhaps you need to do a lot of monkey-patching. Good luck. Next, what I learned through programming, like preparing this talk is that thread programming is very hard. I don't want to do that anymore. So, future plans. I'm going to finish implementing these plugins and I'd like to put them in actual production applications. And probably I want to improve Rails to accept more of these kind of monkey patches, I mean patches or monkey patches. That's it. So, have fun with hacking for the performance. That's all. Thank you.