今日、インテラップストームディダクション 講演を与えるもののお話をしたいと思います。こちらは、このようなのです。まず、インテラップストームディダクション 行動を読みやすべきについて will focus on the background of the development of the I'll talk about.その後、インテラップストームディダクションの 行動について will focus on the development of the I'll talk about the development of the I will talk about the development of the 須藤一。私のプレゼンテーションをご紹介します。私はケント・コヴァイアシーのソニーコーポレーションです。私はディナックスカーネルとデバイスドライバーズのディナックスカーネルとデバイスドライバーズのソニープロダクトのアイボー、デジタルカメラ、ブルーレイディスクレコーダ、 etc。今日は私のディナックスカーネルについてD-Decation featureのディナックスカーネルを紹介します。では私のプレゼンテーションを紹介します。最初、私はこのディナックスカーネルについてをお話します。私のディナックスカーネルで私はこのディナックスカーネルとデバイスドライバーズの話をします。D-Decation featureのディナックスカーネルはコンテナアスインテーションでI think most Linux kernel engineers have seen this problem at least once.When interrupt storm occurs, an interrupt handler will be executed by CPU continuously in short time, like this fear.Interrupt storm causes the system to hang up.Because the interrupt handler will use the CPU continuously, so other tasks can't use the CPU.So, it is hard to debug because the console is not responding.To debug interrupt storm, we need to identify which IRQ causes interrupt storm to check device driver and hardware.But it is very hard to identify.Interrupt storm mainly occurs in two cases.case1 is unhandled interrupt, known as a spurious interrupt.case2 is high-frequency handled interrupt.Unhandled interrupt with where an interrupt handler does not handle the hardware interrupt.Interrupt handler should clear interrupt status and do necessary post-processing for each device driver when interrupt is raised.But in this case, interrupt handler does not do those actions.This case is caused by problems in the device drivers.Interrupt handlers do nothing if the raised interrupt does not belongs to it.So the interrupt is not cleared and will be raised again and again.When an IRQ is shared by multiple device drivers, this case can occur.Generally, with shared IRQ between multiple device drivers, each device driver should ignore interrupt that belongs to it.But if the interrupt handler incorrectly ignores the interrupt, it will not be handled by anyone.So an interrupt storm will occur.And the other case, if the device driver forgets to register interrupt handler.This also not handled raised interrupt.So interrupt storm will also occur.To debug case1 interrupt storm, we can use the spurious interrupt handling kernel feature.That is implemented after 2.6.10 kernel.When an interrupt storm is detected by this feature, disable interrupt and print IRQ number which raises 99,000 times spurious interrupt.When this feature detects 99,000 times spurious interrupt, this message is displayed.So we can clarify which IRQ number cause an interrupt storm.After that, as you know, we can work out what interrupt handler is registered on each IRQ number by PROC interface.In this case, who interrupt handler is registered.So we can investigate which device driver register who interrupt handler by checking source code.Then we can work out the cause of the interrupt storm.Next, I will explain about this spurious interrupt handling feature mechanism simply.This is a simplified explanation of spurious interrupt handling.Firstly, first spurious interrupt occurs in T0.And if next spurious interrupt occurs in before T0 plus 100ms, increment counter.And next spurious interrupt occurs before T1 plus 100ms, also increment counter.If this situation continues and the counter reaches 99,000, the spurious interrupt handling feature will disable IRQ and print IRQ number.If a spurious interrupt does not occur within 100ms of the previous spurious interrupt, counter is clear to 0.This was mechanism of spurious interrupt handling feature.Next, I will talk about case to interrupt storm.This is high frequency handled interrupt case.In this case, the interrupt is raised continuously with the interrupt handler handled interrupt.That means interrupt handler do necessary post processing of interrupt, for example, clear interrupt, but interrupt is raised continuously.This happens due to hardware or device driver's problems.For example, hardware design mistake or design change, this usually occurs at start phase of development.And if we misconfigure the interrupt trigger, interrupt storm can occur.For example, if we set level load trigger for device, when the correct setting is edge height trigger, interrupt will be raised continuously, whether interrupt status is cleared.In addition, if device driver forget clear interrupt cause, of course, an interrupt storm occurs.To debug high frequency handled interrupt, we can use some ways to debug.At first, we can use non-masculable interrupt feature.NMI feature can interrupt and can dump CPU registers and backtrace even if interrupt storm is happening.But NMI has some problems.NMI is not always usable on your development board.For example, NMI is used by other purpose, so NMI can't be used for debugging.Or NMI is not connected for development board.Or NMI can dump CPU register and backtrace, but this can't detect interrupt storm.So, it needs long time to recognize as interrupt storm.Another way to debug high frequency handled interrupt is to use JTAG.I think engineers listening this presentation know about JTAG.JTAG can confirm CPU registers and memory contents.And with this, we can work out which interrupt handler is running often.But JTAG also has some problems, like some development boards don't have JTAG interfaces.And when they do, we need to prepare config files, and that can take a long time.In addition, JTAG equipment is little bit expensive.Another way to debug high frequency handled interrupt is using PISTOA F2S.PISTOA is a framework to save something into persistent memory.The contents of persistent memory remain after reboot.F2S is function tracer.F2S can record function call history into buffer.So, PISTOA F2S can record function call history into persistent memory.If you use PISTOA F2S, you need persistent memory on your development board.And enable kernel config of PISTOA F2S and build kernel.After boot up with enabled PISTOA F2S kernel,enable PISTOA F2S by this command before interrupt storm occur.After interrupt storm is reproduced, reboot your board by pressing reset buttonF2S to fourth reboot.After reboot, you can confirm function history by just before reboot fromC's FS PISTOA directory.Then you can debug high frequency handled interrupt.But PISTOA F2S also has some problems.Because persistent memory is not always available for your development board.And F2S affects performance because F2S records each function history on memory.As I introduced some ways to debug case 2, high frequency handled interrupt,those have some problems to use for debugging interrupt storm.But case 1 unhandled interrupt is easy to debugbecause we can use spurious interrupt handling feature.So we think to develop interrupt storm detection feature to resolve problems of each debug ways for interrupt storm.Now I will talk about our solution interrupt storm detection feature.This feature detects as interrupt storm if number of interrupt exceeds a threshold per 100ms like this figure.If a high frequency interrupt happens in this 100ms, this is detected as an interrupt storm.After detecting an interrupt storm, this feature will plane the IRQ number to kernel load.User can set threshold value which is detected as an interrupt storm.User can select whether to disable the IRQ or not after detecting interrupt storm.User also can select whether to invoke kernel panic or not after detecting interrupt storm.Now I will explain how it works in detail.In this case, user sets 1000 as a threshold.Fast handled interrupt occurs in T0,Increment counter and records the time of the first interrupt with starting to measure.If next interrupt is occurred within 100ms then increment counter.If the counter doesn't reach the threshold,In this case, 1000 times in 100ms interrupt storm has not occurred,so this feature will not detect as an interrupt storm.In this case, 100 interrupt occurs 20 times per 100ms, it doesn't reach threshold.So this case is not interrupt storm.After that, interrupt occurs again after T0 plus 100ms.This T21 becomes the next start time.Then set counter to 1 and save this T21 as first interrupt time.If the counter reaches the threshold value in 100ms,an interrupt storm is detected.This is detail of mechanism our interrupt storm detection feature.Now I will talk about how to use this interrupt storm detection feature.These are the kernel configs.The first setting is whether to enable this feature or not.Default is n, it means disable this feature.If you want to use this feature, please change to y, it means enable this feature.The next setting is for the threshold value which this feature detect as an interrupt storm.Default value is 100,000 times per 100ms.This config apply to all IRQs.The threshold setting must be considered carefully.Because if you set to small value,there is possibility to detect as interrupt storm whether it is not interrupt storm,which cause of hung up or exception.But if you set to big value,there is possibility that hung up occurs before detect an interrupt storm.In next page, I will talk about setting threshold.User can also set threshold for each IRQ number by using the PROC interfaceafter system boot up by using this command.Threshold is different between each systems or each devices.High frequency handled interrupt which is not cause of hung up is not problem.So we allow user to set threshold for each IRQ number.The threshold value to detect an interrupt storm is different between each systems and devices.So we need to think about suitable value for each system or devices.One of example of setting threshold setting enough value to not detect as an interrupt storm in kernel config.For example, 50,000.After booting up,check how many times of interrupts are raised in the last 100ms by this command.For example, if IRQ-15 counters showed 25 hundreds,then set 10,000 for each problematical IRQ number.This is one of example of setting threshold.Next, I will talk about how to debug with this feature.After interrupt storm detect, this feature displays this message.Then we can know problematic IRQ number.After that,check PROC interrupts to know fit interrupt handler is registered at problematic IRQ number.After that,you can debug device driver and hardware because you can know interrupt handler's name.In addition,in this interrupt storm detection feature, user can decide what happens after interrupt storm detection.The user can decide to disable IRQ or not after interrupt storm is detected.This is very useful for debugging after interrupt storm is detected because console may not hang up due to stop interrupt storm before hang up.So we can confirm PROC interrupts after interrupt storm is detected.After disable interrupt,there is possibility that system can't work fine.But if urgent situation, user should enable this feature.User can decide to invoke kernel panic after interrupt storm is detected.This is useful for debugging by stopping and short call trace.But there is some nodes for these features.These features will disable interrupt or stop system so it is big impact for system.Especially the disable interrupt feature,the system appears to be working fine but IRQ is disabled.There is possibility that some other problem will occur.So you should disable those features after you clarify the IRQ number.These are how to use the other features.We can set these features using kernel config and PROC interface.Kernel config will set default for all IRQs.So if set Y,it means enable those feature,it applies to all IRQs.With the PROC interface,user can set each IRQ number.So if you want to enable specific IRQ number,you set N in kernel config after that you set enable to PROC interface.In addition,user can access some useful information for debugging.We can see this information for each IRQ.We can see the threshold value number of interrupt per unit time currently observed.Setting to disable interrupts after interrupts term detected.Setting to invoke kernel panic after interrupts term detected.And max number of interrupts detected per unit time.This is she can diagram when interrupt is raised from hardware before adding features at arm64 architecture.Splish interrupt handling feature runs here.No IRQ debug flag can be set by module parameter.That flag default is false.So if you don't set no IRQ debug flag by module parameter,Splish interrupt handling feature works.That place is after being executed,device drivers interrupt handler.So if that place,we can know this interrupt is handled or not.It happens in the architecture independent part.And this is she can diagram when interrupt is raised from hardware after adding interrupts term feature.This will be executed after splish interrupt handling feature.Because we can know this interrupt is handled or not at this place.This is also implemented in architecture independent part.Now I will talk about example on actual product with interrupt storm detection feature.That problem is exception,which is caused by softlockup at dosoft.rq.In our development board for a product.Let me conclude first.That problem,this problem is caused by interrupt storm.To debug this actual problem,at first debug about softlockup.Debug softlockup are used config lockup detector and config boot parameter softlockup panic at first.And after build kernel with those kernel configs,boot up system with build kernel.After system is booted up,enable softlockup panic in proxy kernel by this command.After that,reproduce problems,cultrace of lockup is shown.So,confirmcultrace.In this case,softlockup at dosoft.rq is displayed many times.There are many possible causes from this result.For example,detlock with spinlock.San and prem table task works long time.Interrupt storm occurs and so on.To break down problem,I try to use interrupt storm detection feature.Then I used interrupt storm feature.I enabled kernel configs and set threshold at 10,000 times per 100ms and build kernel.That this message is shown with enabled interrupt storm detection kernel.Then I can see the ILQ number that caused the interrupt storm.So,I confirm which device driver's interrupt handler is registered on ILQ387 by proc interrupts.In this case,PCRE device driver's interrupt handler is registered in ILQ387.So,we investigate in PCRE device driver,PCRE hardware and devices connected via PCRE.As a result,we worked out that this issue is caused by FBGA firmware which is connected through PCRE.After that,we fix FBGA firmware.This problem is not occur.Finally,I will talk about interrupt storm detection feature limitations.This feature can't identify device driver which use shared interrupt handler.We can only know the ILQ number when interrupt storm is detected.So,if that ILQ is shared,we need to do additional investigation to identify the device driver.And more one limitation is this feature can't detect interrupt handler which occupies CPU for a long time.In this case,the interrupt is not high frequency,so this feature can't detect as an interrupt storm.In the future,I will submit this feature to Linux kernel mainline.Thank you for listening my presentation.