 첫 번째는 백그라운드입니다. 이 부분은 멀티코러와 트렌드에 대해 이야기합니다. 요즘 멀티코러 프로세서는 인테리어 파트와 인월드 디베이스와 인테리어 시스템이 되어있습니다. 또한, 글자들의 글자들과 많은 글자들이 계속 increasing. 이 글자는 트렌드의 멀티코러 프로세서를 보여주었습니다. 그리고 글자들의 서버 데스크타피스 CPU, 그리고 애플이예요. 이 글자들은 인월드 디베이스 마켓 포켓을 사용했습니다. 이 글자의 멀티코러 프로세서는 인월드 디베이스 마켓의 멀티코러 프로세서를 보여주었습니다. 이 글자들은 AP 마켓의 멀티코러 프로세서를 보여주었습니다. 이 글자의 멀티코러 프로세서는 인월드 디베이스 마켓의 멀티코러 프로세서를 보여주었습니다. 이 글자의 멀티코러 프로세서를 보여주었습니다. 이 글자의 멀티코러 프로세서는 강력한 적용을 제공할 수 있습니다. 이 글자의 멀티코러 프로세서는 개인 컴퓨터를 사용할 수 있습니다. 이 글자의 멀티코러 프로세서는 태그라-3의 오브 앰비디아에서 만든 글자의 멀티코러 프로세서입니다. 이 아키텍션의 새로운 모습은 4코어에는 A9의 코트엑스, 그리고 1코어의 코트엑스, 그는 A9의 코트엑스 같은 아키텍스에 designs, 그런데 이 코트엑스는 500MHz가 제작되어 있습니다. 탱글아3의 코트엑스에 이 아키텍처에 의한 어떤 점이 있을까요? 이 아키텍처에 의한 어떤 점이 있을까요? 태그라스 3는 더 많은 공간을 제공할 수 있습니다. 멀티코어 프로세서는 공간을 제공할 수 있습니다. 하지만 대부분의 테스코 스마트 디바이스는 오디오프레이, 심부 웹설팅, 액티브 스탠바이 모드에 필요합니다. 확실한 속도가 필요하지 않습니다. 그러나 저의 공간이 더 많은 공간을 제공할 수 있습니다. 태그라스 3는 더 많은 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 다른 공간이 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 암티브 스탠바이 모드에 필요합니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 이 아키텍처는 전체 공간을 제공할 수 있습니다. 런웨이블 테스트 매니지멘트는 론드로빈 스케줄인 파티시에서 스케줄 wasn't complex, but was simple and fast at that time. 리누스커널 버전 2.2에 아이디어로 스케줄인 클래쓰, permitting 스케줄인 파티시, for real-time tasks, non-preemptive tasks, and non-real-time tasks. The 2.2 스케줄이 시멘트링 멀티프로세싱 프로세서에 심플파티시와 SNP를 설정합니다. 그리고 리누스커널 버전 2.2, 스케줄 락과 싱글 런큐를 사용하고 이 버전에 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 스케줄이 시스템에 적용되지 않습니다. 이 스케줄은 심플파티시와 싱글 론드로빈 스케줄을 설정합니다. 론드로빈 스케줄을 설정합니다. 2.6 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 2.6 스케줄을 설정합니다. 스케줄이 1컴플렉시리에 적용되지 않습니다. 우원 스케줄을 설정합니다. 우원 스케줄을 설정합니다. 2.4 스케줄을 설정합니다. 이 버전에 스케줄을 설정하지 않았습니다. 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 우원 스케줄을 설정합니다. 런큐를 설정합니다. 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 스케줄은 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 이 버전은 우원 스케줄을 설정합니다. 우원 스케줄을 설정하고 우원 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 우원 스케줄은 싱글 론드로빈 스케줄을 설정합니다. 스케줄이 싱글 론드로빈 스케줄을 설정합니다. 마지막으로 2.6.23 버전을 설정합니다. 스케줄이 싱글 론드로빈 스케줄이 싱글 론드로빈 스케줄을 설정합니다.ивалabbutin policy, lealabu, lealabu, lealabu, lealabu, lealabu, 9 didn't. lealabu, lealabu, CFS는 3개의 설치와 인서션을 사용할 수 있습니다. CFS는 2.6.23%를 사용하고 있습니다. CFS는 최고의 스케줄인 파티시입니다. 이 means implicitity, the fairness of CFS is most superior than others in a single core. So then how is the CFS in multi-core environment? Of course, since CFS has many good features for a CNP processor, CFPRO is still considered smart but despite such a advantage the Linux can for short of expectation when it comes to fair share multi-core scheduling. So in the next part, I'll talk about the details of CFPRO's multi-core environment. To analyze CFPRO's in the multi-core environment, 먼저는 롯바랜싱 기술을 생각해 봅시다. 이 슬라이는 CFS의 롯바랜싱을 보여줍니다. CFS에서 모든 코를 배우는 롯블랙3의 Q입니다. 이 컨디션에서 롯바랜싱이 시작됩니다. 롯바랜싱의 시작뿐만 아니라, 첫 번째 컨디션은 택sqpq의 시작때문입니다. 두 번째 컨디션은 롯블랙3의 Q입니다. 두 번째 컨디션은 롯블랙3의 Q입니다. 두 번째 컨디션은 롯바랜싱 기술을 생각해 봅시다. 첫 번째 컨디션은 bgstq의 시작때문입니다. 택sqpq의 bgstrongq이 롯바랜싱 기술을 맞춘 롯바랜싱 기술입니다. CFS는 공항되어 질언이 vacuum으로 보관하지 않습니다. 하지만 롯바랜싱 기술을 생각할 수 debris가 필요합니다. 여부 개선을 estimating의 기술을 맞춘 CFS를 유지할 수 있었으며, 육수도와 조금씩 물을 유지할 수 있습니다. Therefore, this scenario may relate to some overhead. If the chas period is shorter, the overhead can be larger. But actually in CFS, overall performance is more important than global fairness. That is the CFS does not migrate the task in order to reduce migration cost as possible according to these conditions. 예를 들어, QK의 론은 QK의 론이 1번의 론을 제거하고, SK는 QK의 론을 제거하고, 2번의 론은 QK의 론을 제거하고, 3번의 론은 QK의 론은 QK의 론을 제거하고, QK의 론을 제거하고, QK의 론을 제거하고, 이것은 CFS로드 밸런스의 예방입니다. 이 예방은 T1-T5에서 5개의 테스트가 있습니다. T1의 weight is 1024kg, T2-T5의 weight is 335kg. T1의 weight is three times bigger than others. T1의 weight is running on Core 1, and the others are running on Core 2. According to previous formulas, scheduler will decide whether to perform road balance or not. But in this example, scheduler decides not to perform road balance. As a result, although T1's weight is three times bigger than others' weight, runtime of T1 is four times longer than other tasks. Therefore, in this case, CFS is forced to achieve fairness, and even if exactly the same amount of road is given to each core, it will happen again. Thus, our team has researched to complement this situation, and we think carefully that it's time to rethink the scheduler for embedded Linux system. But we face the problem that is the global fairness, the most important factor in multi-core. It is hard to answer easily, so we performed some experiment to find the answer. Literally, perfect road balancing ensures that all cores have same amount of work, and the execution time of each cores is almost same. In our world, superior road balancing scheduler has outstanding global fairness. Therefore, the first requirement of efficient multi-core embedded Linux scheduler is of course the global fairness. But as you know, task migration may cost some overhead cost, and it may also cost to increase the cache miss rate. So the second consideration of multi-core scheduler is maximize cache effectiveness. However, the expectation of cache effectiveness is very difficult because they are influenced by their workload, CPU architecture, and so on. But in case of embedded device like smart phone, we can find out some common feature like this. So we think it is possible to design the new scheduler to support efficiently for multi-core embedded device. But as you know, it takes sufficient time to develop, and it's so much hard work. So among those requirements, we decided to research to improve the global fairness above all. From this slide, I'll talk about our research result last year. In last year, our team tried to analyze current scheduler and to improve the global fairness. So we adapted previous research called distributed weighted round robin, shortly called DWLR algorithm to CFS scheduler. And compare CFS and DWLR as conducting various experiments. So in this part, first I explained the DWLR algorithm, and next part I'll show the comparison experiment result. DWLR is the most intuitive algorithm, and its main goal is simple and obviously to achieve global fairness in multi-core. DWLR algorithm is proposed by Lee as a scalable multi-core processor fair share scheduling algorithm. It schedules tasks via weighted round robin on each core. Of course, it performs road balancing to ensure that all tasks go through the same number of rounds. This is the main architecture of DWLR. As you can see, each core has two run queues. One is active, and the other is expired run queue, like O1 scheduler. And each run queue is also read with three. There are three important definitions in the DWLR. First, round slice is defined to be small w and large b, small w multiply large b. Where w is the stress weight, and b is the system-wide constant, which is round slice unit. Second concept is round. Round is the shortest time period during which every thread in the system completes at least one of its round slices. The round slice of thread determines the total CPU time that the thread is allowed to receive in each round. When a thread uses each round slice, we said that this thread has finished the round. The DWLR removed it from the CPU run queue to prevent it from running again, and the thread is moved to round expired run queue. This operation can be maintained, can be maintained local values. Sorry. When all threads on this CPU have finished the current DWLR round, DWLR such as other CPUs for threads that have not and moved them over. This action is round balancing. If none is found on the CPU, CPU increments its round number and allows all local threads to advance to the next round with the full round slice. This operation can be maintained global values. In the next slide, I'll show you an example of DWLR operation. This shows the DWLR operation. First, assume two CPUs and three threads A, B, and C, each with weight one and round slice of one time unit. And one important assumption is that task migration question is ignored. At time 0, A and B in round active of CPU 0 and C in round active of CPU 1. At time 1, both A and B have run half a time unit and C has run one time unit because of DWLR excuse the round slice. Does C move to round expired on CPU 1 since its round active becomes empty? CPU 1 performs round balancing and moves B to round active, not A because it is currently running. At time 1.5, both A and B have run one time unit so they move to round expired of their CPUs. Both CPUs then perform round balancing but find no thread to move. First they switch round active and round expired and advance to round one. Therefore with DWLR we can finish the job at time 1.5. But how about CFS road balancing? In case of CFS algorithm, road balancing will not be performed B, Cs which is same and thus it is not conditioned to perform road balance. So A, B are performed only in CPU 0 and C is performed in CPU 1 thus total time is needed at time 2. Therefore DWLR is able to solve the previous CFS problem. Basically it moves B over to CPU Wait so it moves B over to CPU B After round 0 it goes off it looks like it balanced B to C we'll confuse why the first line from A and B are on same CPU correct and C is on CPU 1 and then you switch then you migrate B over to CPU 1 If this continued B actually bounced back to CPU 0 after this whole slide if A is running now and you have C and B then it's possible you could be ping-ponging B between CPU 0 and CPU 1 I'm sorry Maybe I don't understand what you're talking about Second figure Second figure Where is the figure? Let's say where the arrow is starting A and B are sharing C finishes its round or whatever but C is still in the running state correct? Yeah But you're saying it just finished a round and you're doing a load balancing and you say oh I can move B on the CPU Yeah So A gets more CPU and A finishes C and B Now you finish off A on its own CPU and on the CPU 1 you have C and B Now if we were to continue the round A would finish its round and then you have C and B using each of the CPU What if that also if you continued here load balance B again back to CPU 0 So basically at the end of the slide you can actually replace A, B with C and B and C with kind of A and what if that causes a ping-pong Maybe I don't understand what's going on But we assumed that A, B and C just wasted it 1x1 Right I understand you're trying to give everyone more fairness or whatever But the problem is it's what caused a lot more ping-pong tasks and you're going to hit cash problems so you may actually slow things down What you're trying to ask is the cause of context of switching from one CPU to the other Yes Are you taking the cost of the migration itself? Yeah obviously you're right but we did not did not consider because this research is the first step of our project so we just improved the global fairness So the cash mist rate was some other things Because there's a reason why CFS does not try to be truly balanced There's actually a reason for it That's why I'm saying you can make it truly balanced but you might actually perfectly expect So DWRI is the best solution for a multi-core system We think the DWRI obviously have pros but because of their architecture like 01 assumptions are too critical to be neglected You're right DWRI have also obvious weak points like the non-consideration migration higher cash mist rate and so on So we have to obtain its pros, global fairness and we improve its limitation and our team have tried to do so continuously So far I explained DWRI algorithm key ideas and its limitation Now I'll show you the comparison result between CFS and DWRI The test environment is like this Our target post consists of quad-core CPU Quotex A9 and we used Rinaural Ubuntu 12.04 and the kernel version is 3.1.41 The comparison schedule are of course CFS and CFS with DWRI This is the practical experiment result of comparison of CFS and DWRI We ran 5 tasks on 4 cores and the tasks are just performed in infinite loop and we use top util to monitor CPU allocation This figure shows the snapshots of top for each schedule As you can see CPU utilization of tasks in DWRI achieved nearly perfect fairness This is a natural result on reflection of DWRI operation So we need to perform more experiment This benchmark is performed to show the performance comparison when the CPU cores are busy In this case we use the video play and the running time of video is 150 seconds The video decoding time of each case are these As you can see in this case DWRI has also best performance Another test is to compare the performance with database benchmark Recently most smart device has their own database engine in their platform and users of database is growing So we tested the database benchmark using sysbenchy the test mode or tp which has a lot of file IOR and IOR pattern is render models and size data I think this pattern is common case in smart devices As you can see in this case DWRI's transactions per second is second larger among them Interesting point of this CFS with scared granularities at 1.25 millisecond which is equal to DWRI's round slice unit show the best performance Consequently in case of database which have a lot of smaller size render model show lower time slice has better performance But overhead with DWRI slice slide I think Final benchmark is javascript benchmark Maybe web application are in the spotlight today because of its independability for any platform the powerful HTML5 based on javascript function So it's important to benchmark the HTML5 especially javascript engine and we tested the javascript benchmark using Sunspider which is famous benchmark suit In this benchmark DWRI has good performance compare with CFS But I think this result actually not meaningful because javascript engine does not use multi-thread Anyway in this case DWRI is quite good I think Q. Are you also measuring the fairness of the different schedulers? No It's very hard to detect fairly So actually we test the first benchmark we show the DWRI has the best fairness other experiment we just know is the DWRI is really good in common case or practical case So Okay Conclusion As previous mentioned the weighted based algorithm in multi-core like CFS is able to achieve fairness in practical even if the exactly same amount of round amount of road is given to each core Yes DWRI can be new trial to improve the schedule in multi-core system but DWRI has also have several problems so it is not the best solution So consequently our conclusion is that thinking the multi-core of our schedule is worse enough now so I hope to be develop the new schedule which can fulfill the requirements embedded multi-core systems and this is our team's future work Okay, thank you Any question? Yeah I just want to be sure sounds like you're just trying one big task and spreading it's work across the processors and of course your schedule is going to work better for that one big task but what about the problem where a typical system is doing 20 things at once and some of them have different priorities and whatever how do you solve that problem with this model Okay Actually our organization is concerned our government so our project is is just focused on the special case so we just try like this scenario but we have to do so many experiment and so many yeah think so many environment or some case so when we we will get some some result we will sounds like it needs to be more of a policy that could be adjusted in the kernel rather than changing the core of the schedule okay okay another question yeah thank you for listening