 So, I'm Paolo Valente, I'm an assistant professor from the University of Modern Regimilia, but I'm here as a collaborator of Linaro, and this presentation is about work supported by Linaro. So, the protagonists of this presentation are high latencies caused by I.O. and a possible solution to this problem. The solution is the BFQ I.O. scheduler. So, to try to make sure that nobody lacks the information needed to follow what I'm about to show you. I'll start by spending a few words about I.O. schedulers. Then we will go straight to the point. I will show you these high latencies, and I'll show you a BFQ at work. In particular, I'll show you demos and numbers on the systems listed on the slide. We have run and we are running tests also on other systems, but I will limit this presentation to the systems listed there. This presentation might allow some questions, so I have already added answers to two possible questions directly to the presentation. I guess that these questions will become clearer after showing you what I want to show you. So, I.O. scheduler. An I.O. scheduler is a component that decides the order in which I.O. requests are served by a given storage device. In particular, an I.O. scheduler does this when there are several entities, processes, groups of processes, applications that compete for the same device. This job, this reordering, this control over the order of I.O. requests is rather important. Why? Because this is the only way to achieve several important goals. One is reaching a high throughput because storage devices, the throughput of storage devices is very sensitive to the order in which I.O. requests are dispatched to the device. Another important goal is guaranteeing a low latency to tasks that need I.O., tasks that to be accomplished need I.O. An example is reading frames to playback a video. Another important example are interactive tasks, such as starting an application, reading a file, saving a file, and so on. If you provide, you guarantee a low latency to interactive tasks in a system, then you are guaranteeing a high responsiveness to application in the system and to the overall system itself. This in the mobile system community responsiveness is usually indicated or measured, or more precisely the opposite of responsiveness is usually measured, indicated as lag. So guaranteeing a high responsiveness means guaranteeing a low lag. In this presentation, we will use the start-up time of applications as a measure of responsiveness. Starting an application is an important example of generating interactive tasks and measuring the start-up time of applications also gives an idea of the latency that you in general guarantee to any possible generic interactive task. Then schedulers, I.O. schedulers have also other goals, but they are out of the scope of this presentation. Finally, where are these I.O. schedulers in Linux? They are in the I.O. stack of the operating system of the kernel. In particular, this I.O. stack currently comes in two flavors. One is the legacy block layer, BLK. The other one is the new multi-Q block layer, block and queue. And the sets of schedulers that you find in each of these block versions of the block layer differs. The only scheduler in common is the deadline. In particular, in legacy block, you don't have BFQ. To use BFQ, you have to pick it from the... It's out of three. You have somehow to add it to your system. While in block and queue, fortunately, there is also BFQ. You find also BFQ. I stop here about I.O. scheduler. I'll show you in practice what this I.O. scheduler can do and above all what they cannot do. So let's introduce these tests. All these tests have been run on systems using EMMC or SD cards. And both on Android and on some Linux distribution. As for Android, the kernel versions used in Android systems are too old to have block and queue support in the EMMC subsystem. So no block and queue support, no BFQ. In these tests, I will use the out of three version of BFQ, which is available also for legacy block. On the other side, on the opposite side for Linux distributions, you do have block and queue support also in the EMMC subsystem from 4.16 onwards. And in my test, I have used systems with a recent enough version of the kernel, a release candidate of the 4.16. Then, as for Android, the two systems tested are the high-key board and the Google Pixel 2 phone. And for Linux distributions, again, the high-key board and the Poco Plug Series 4 and then a laptop with some SD cards. OK, I think I already told you all I had too, so we can start with the first demo. Actually, I'll show you a trimet version of this demo. The demo is longer and there are other parts. OK, I will focus only on some parts about lag. In this demo, we measure lag by starting the Facebook app in two different conditions. First, with nothing in the background. So no IO workload in the background. And then while some updates are in progress. These updates are performed through scripts in some automatic way and in particular they are run controlling the download speed to test the weather range of scenarios. And in this test, the download speed in what I'm about to show you, the download speed is around 15 megabytes per second. At this speed, as I'm about to show you, starting Facebook starts to be a real problem. I'll show you results only with the NOP, but they are the same with CFQ and Endline. I remind you that here we have the legacy block layer. OK, I think I told you everything so we can. On the left side you will see the startup of Facebook when the system, the board is idle while on the right hand side the same exact task but this time with updates in the background. So it takes just six seconds to start Facebook while in the other case you have to wait much more. OK, so here we have these updates at a controlled rate, 15 megabytes per second. If we increase the download speed then the time to start Facebook becomes infinite. You have to wait forever until the workload stops, Facebook does not start. OK, then there are some considerations about how realistic these download speeds are but I don't want to go into that now because I want to show you what happens with BFQ. Same identical test, but this time we make BFQ control DIO. Left side again device idle, right hand side knob in the middle you will see what happens with BFQ. And it will take, if I remember well, ten seconds. So it's four more seconds with respect to the idle case and that is due to blocking issues in the virtual memory subsystem. This is out of control for BFQ but it can be solved with BFQ but this is a different story, not for this presentation. Then in this video, in this demo we went on trying with heavier workload and we saw that with BFQ you can increase the download speed but the startup time of Facebook remains the same while with the other scheduler we wait forever. In the end we saw that one of the heaviest workload is just transferring a file to the board on a fast connection so we repeated this test in the video you could see this test done also in the other case. I don't want to show you that too, I want to show you directly in another test video playback. So what happens with a soft real time application if you have this heavy workload in the background. As before with BFQ, sorry, not as before. Now the workload for BFQ is much heavier because those are two files, file copies in parallel and believe me they are much heavier than just transferring one file. So again on the left side playback of the movie when there is nothing in the background in the middle BFQ on the right hand side the other scheduler. The video starts and it is played back smoothly in the idle case with BFQ it takes a little bit more for the reasons that I already told you but then when it starts the playback is perfectly smooth with nop, guess what you will wait a lot. I don't want to make you waste your time waiting so we will skip directly to when the video finally starts and after ok, let's do the jump after waiting for so long the video the video freezes and it will freeze again in a moment there ok, and so on so we have a problem there is a latency problem and BFQ seems to solve it first objection that I want to do during this presentation to BFQ ok, this happens because the system, the board is now slow one, the storage is slow if we move to a much faster device these problems will just disappear I did it I tried with Google Pixel 2 which I bought exactly not only for this and this time I wanted to run a natural test so no scripts no instrumentation of the code no shaking of the device especially because the device is mine and it's not so cheap and so I looked for a fast network to do the same I found a simple way to trigger updates and I tried but unfortunately in Italy I couldn't find a fast enough network which in a sense is good because if you want a vacation without lag issues caused by updates consider Italy in my case I needed that problem so I resorted to something else a lighter workload that I could generate just some file copies one after the other this is lighter than an update less resources only storage less hardware and software resources but it has in common with updates the key the key property that causes those high lag intense IO because I showed you that the problem is IO because BFQ solves that problem and BFQ works only on IO so if we have a high lag file copies then we have a high lag also with updates on a fast network let's see what happens so again I started Facebook twice first while the phone was idle and then with some music files being copied in both cases I um um terminated Facebook and cleared its data for two reasons first to make sure that actually Facebook must do IO to start otherwise this wouldn't be an IO test second to put the phone in exactly the same conditions in both cases I forgot to tell you that of course I did the same also in the test for the I key ok so first we start Facebook we terminate Facebook clear ok details on the phone brand new pixel 2 no change made anywhere so no BFQ this is only to show you the problem not a possible solution ok so terminating Facebook and clearing data and restarting Facebook and it will start very quickly no no no no no I mean as far as I know but I don't think so yeah it's only Facebook data ok then the same is a test but this time I will first start some copies how much data enough data to make sure that the copies go on for all the start up of Facebook actually much more because I exaggerated ok and then we do again the same we terminate Facebook and clear data and restart and it will start very quickly actually no let's take a coffee now because this is much worse than the other test and the reason is that the IO of the copies is sequential and greedy and this is what the system likes most because it gives you the peak rate so both the operating system and the internal your scheduler of the storage device do like that IO and they prefer it over the mostly random IO for starting Facebook seven times as much as you waited before you can find the same consideration that I already told you there plus a bit of information you get the same exact problem if you do a file transfer while you are starting Facebook or vice versa you start Facebook while doing a file transfer ok so I hope I made my point here ok, this happens because it's enjoyed ok, let's try with a Linux distribution it means the truth of the device I have a slide on that just give me a few minutes and I'll show you I did the same object to me before and that's a good point that's another good point according to how file copies work I think there was no memory, no particular memory pressure probably, yeah but I didn't check but I guess no, I guess no yeah, that's probably, yeah yeah anyway, I was interested into causing a bottleneck on the IO without saturating the CPU for me, yeah ok test on a Linux distribution I used the S benchmark suite for these tests and I think I can show you directly the demo I can talk on the demo Debian with the 4.16 rc2 as you are about to see from the video if it is running ok and this time the application under test was exterm and unfortunately with this release candidate I had problems with the X server so I couldn't run exactly the application so I used the feature of the S benchmark suite which allows you to replay the IO of the application that you want to test and believe me results are the same try yourself with the suite it's public ok and so we start we begin by starting exterm with nothing in the background we will try this I guess twice just to have an idea of the average as usually I the script clear caches before starting the application we want to test IO caches so try twice the average should be something like 0.32 more or less and after this we retry but this time with one file read and one file write in the background so we are emulating a copy we leave known as IO schedule now we are in block and queue so there is known as one of the schedulers once so one time one reader, one writer known as IO scheduler and guess why we try only once you can take another copy and call a friend because this time it's much longer waiting is much longer again the reason is that the reads and writes are sequential are greedy everything but loves the type of IO so it prefers the type of IO and the type of the kind of IO goes on cutting in front of the mostly random IO to start extend waits not forever but almost ok I have to say something guess it's long but I think a few seconds and it starts ok, let me anticipate what is about to happen ok, 48 seconds at least a bit more than 0.3 then with BFQ again much heavier workload 5 readers, 5 writers this time we repeated a few times to have an idea of the variability so 3 repetitions 5 readers 5 writers BFQ this is the result 0.5 but this is a lucky iteration on average should be something like 0.8 0.9 ok so we have a problem also with the Linux distribution at this point I need to become more brief with demos it's too long to show you a tour of analysis so I will switch now to graphs only a few graphs don't worry graphs of start-up times and of throughput about start-up times I tried with several many applications in particular this last test with external, non-terminal, liberal office writer results are about the same with liberal office writer non-terminal and these are results so on the x-axis there are the 2 workloads for which I repeated this test 10 readers, 10 sequential readers 5 sequential readers plus 5 sequential writers which for the reasons that I already told you are the nastiest workload for latency on the y-axis the start-up times the bars there is a bar for each scheduler and finally there is the red line shows the reference start-up time the one when the device is idle there is the next in case the application didn't start at all ok which means it took more than 120 seconds and time out fire then the test stopped as you can see for x-term bfq is a little bit better than the other schedulers the start-up time is about the same as if the device was idle so you don't even realize that your system is under load this is the nice thing with heavier application like non-terminal the situation is very very bad for the other schedulers because it doesn't start at all the time scale there in the second graph is rather smaller than in the first graph and with bfq start-up time is about a little bit more than what you get when the system is idle and finally his question so ok, this seems to be wonderful but what do you make me pay in terms of throughput I tested it and the answer is apparently nothing this test is with the same workload sequential and random plus random workloads to widen the scope the coverage of this test as you can see performance is about the same there are some differences I don't want to go into these details I just wanted to show you that you don't pay anything there are some axis also here basically that the system became so unresponsive that the script didn't make it to stop the workload so results were unreliable what else yeah this happens with these steady workloads with other more complex workloads there are some regressions like in some cases 20 percent around 20 percent of throughput loss and this is something we are working on trying to understand exactly where the problem is ok, now I'll show you results on this other system then on one last system then it will be over this system is interesting ok, credits Linus Wallay ran these tests on with this apoco blog with an SD card plug into it so the test was on that SD card so again flash storage this case is interesting because bfq is not that good I will show you this with only one graph because with the other application results are about the same information is about the same so as you can see it takes time also with bfq a lot of time with respect to the idle case as usual bfq is much better but anyway you have to wait fortunately the problem was somewhere else the puddle neck was not at least not only the storage but the cpu also the cpu and I will show you test with that exact SD card but this time plug it into a much faster system a laptop again test run by Linus test ran this test with several cards performance are about the same as before it's the same with no terminal or liberal office so I'll show you results only with that SD card and for external and non terminal and to sum up the situation is just much worse for all the schedulers but bfq with bfq again for external you have about the same startup time as if the device was idle some other scheduler makes it with one of the two workloads but if we move to non terminal it's disaster as you can see yeah for those of you that are still awake the first part is gone and now I want to move to the the second one trying to anticipate some possible dub that you may have so first question about the fact that actually I showed you a disaster but where is it and second question why does only bfq work so as for the first question I showed you so this terrible results but we use our personal devices and they are perfectly responsive almost always why? just because storage is almost always underutilized that's the only reason but occasionally we have troubles if the system happens to be reading or writing some large file compiling doing some large compilation compilation running some updates and so on but if your use case is above this average then you do have problems probably some of you know it because the behavior of the system changes dramatically and exactly what I showed you starts to happen it becomes in some cases almost impossible to use the system and there are no practical solution nothing ready apart from bfq if you have bfq and if you know that you have bfq in your system and our perception with respect to this is usually that it's somehow normal inevitable we just wait do something else and wait for the storm to go away but it is not inevitable and some companies have turned this into a business opportunity because they have proposed systems that guarantee absence of very low frequency of these lag issues and so give the user much better experience ok, this was for personal use about services again services provided by good service providers don't suffer from high latency usually but the systems that provide that implement these services are maybe subject to intense IO why everything works just because the typical solution implemented by the engineers on the other side is over provisioning so the idea is just to guarantee that the system is always almost always underutilized it works it has only a problem and it is very costly because it's costly in term of extra resources energy and ultimately money in Italy we have a saying that I found written in English something like he who has not a good brain ought to have good legs so in this case what can I say make your legs stronger or add more legs if you don't have brain in some cases you add a little bit of brain because as an alternative as an additional solution the doc scheduling or tuning that improves things but usually this is very rigid and just fails if the workload changes with respect to what that solution is tightly tailored and the very last point why also all schedulers fail miserably only bfq survives why because bfq implements a combination of three techniques this one first it performs proportional sharing of the throughput in a rather accurate way so every process or group of processes is associated with the weight and receives a share of the throughput proportional to its weight first second detection of DIO to privilege limited resource here is bandwidth bandwidth is limited so if you want to guarantee that an application starts takes the same time to start as when the device was idle you have to give to that application the same throughput as when the device was idle you have to privilege that application to do that and discover which the application is what is the process that must be must receive much more throughput than the others then you have to actually guarantee a higher throughput to that process so detection and enforcing is done basically by just raising the weight of that process to that third plug dispatching of IO this is also known as device idling if the process in service has no pending IO momentarily then BFQ does not dispatch IO from the other competing processes BFQ does so if the process in service does sync IO the typical victim of this of this problem because they do IO then wait for the completion of the IO and then do new IO so they frequently have some short time intervast during which they are deceptively idle so this plugging prevents low weight IO from cutting in front of high weight IO that's the idea you have a problem when if the low weight IO is a type of IO that the rest of the system likes like sequential IO sequential greedy IO that's the extreme case but there are other cases in the middle in that case that IO while your high weight process is idle then that IO will just go on cutting in front of the IO of your high weight process just because the rest of the system wants that IO to cut in front to achieve a higher throughput the only way is to plug that dispatch of course exactly because of what I just said you don't give to the storage device that sequential IO you risk to lower the throughput that's one reason the other reason is that modern storage devices to reach peak rate want to have their internal queues constantly non-empty if you plug IO you tend to lower the feeling of the queues and so also to lower the throughput so the actual challenge here is doing this plugging for as little time as possible but I won't tell you more about this because our time is finished so thank you if you have questions I'll try to avoid the tougher one that easy part it's just a flag somewhere else that IO is flagged and sink but there are cases where that flag is misleading in most cases it is not well now that you are I mean no not yet but yeah why not absolutely no I didn't run those tests here yeah or if you want to as I was told to do give it the question was if he tested with any other media besides SD and EMMC you mean beside that combination or in general other mediums oh yeah absolutely any rotational device raids SSDs whatever I found that could store something I tried and it was roughly the same actually the new tests are the one with MMC because the support for block and queue is a new thing in the MMC subsystem the what is yes in block and queue the new version of the block layer is supported by the MMC subsystem that is the subsystem that end those EMMC, SD card and all the rest so this support is available only from 4.16 onwards that's why this test we have done this test only recently 3 weeks ago ok thank you