 So I'm sounding way better than before. So what I want to pick your brain about or talk to you a little bit is something we are working on which is not related to the RTPash set. So we think about how you can do, how you can help users or developers to actually find problems in systems. So I thought about pattern-based things and that involves big data. So now let a kernel developer think about big data, what could go wrong. Actually everything. Because I started with a very naive approach. Just create a gazillion of traces and then reduce the space. I was able to reduce the space but I got roughly a gazillion of unique patterns back. So the idea is what I wanted to achieve in the long run is actually having pattern recognition in some form and then document what the pattern means. What is this? What are you looking at? So no, we're not starting to document a gazillion of different patterns. So I estimated we would be done with the set I had somewhere in 2019. I might be off for a year or two but that's roughly the time frame it takes. So I had to go back and revisit my approach to that. Actually I talked to people, understand how big data mining works. You really were naive so you have to split it into sub-patterns. You have to tell the data miner what you're actually looking for. So that makes the way way way way easier. And so I think that documentation of the sub-patterns is possible. And then when you look at something you probably always see a combination of the sub-patterns and then you need tools to stitch them together and so you get the coherent information out of it. So basically what we're trying to do is to come up with abstract descriptions of what's happening. So first when the task is woken up, why is it not going on the CPU? So we're just looking between wake up and switch, context switch. So there might be something else which is higher priority. There might be an interrupt interfering. There might be preemption disabled for a very long period of time and things like this. You can see that from traces. And then if you go further down from context switch to going out to use a space, what can go wrong? You can get preempted once more. You can run into a lock and block on something. And so there's a lot of variance to that scheme. But we can reduce that to really a comparable small number of patterns. And then go and document the patterns and say, OK, this happened. You might look into that. It's not going to tell you where exactly the problem lies. But it actually tries to help you to understand what the problem is. So that's where I really want you to talk to you about and pick your brains. What do you think that's something worthwhile to pursue? Or would that be something you wished you had two weeks ago when you were staring at the trace and didn't figure out what the hell it means? So what idea do I have as well, what it could do is kind of warning system. So if you just have stuff running in your test lab and you watch the traces and analyze the traces which you generate, and you know there's something you really want to avoid on your way back to user space like lock contention and things like that, you can actually do that as a kind of, oh, it's not failing, but it shouldn't be there. So I have to look why this is happening and that is happening. So that could be something, fail a prediction mode as well. So yeah, that's, I have nothing working right now. So it's, yeah, I have some horrible scripts for that. But so it's, we are in a state where we try to come up with a proper storage format for the patterns. We can actually identify the sub-pattern thing and find places in the trace where we have the explanation. And then, I mean the other stuff where we then search in a large trace for something, I mean it's always going to be something, oh, where is the longest latency from wake up to back to user space for a particular task and then you can get the information what happened on the way from wake up to that thing. Or it can even extend the thing up to back in kernel space. So when is the next, when your computation finally ends, when you go back and wait for the next period, for example. So I mean this kind of search can be done. So that's where I wanted to pick your brains. Is that useful? Would that be useful? Would you wish that it had been there before? Yes. Especially for post-mortem analysis, when after a certain things have crashed, having something automated to look at traces, at least to read out the common ones. I spend pretty much 90% of my time looking at traces to realize that this is something I've seen before because it has been fixed. So I have some scripts now, which... Yeah, I guess everybody who has to deal with traces has a gazillion of scripts. Yeah, so basically you throw some regex at it and you do some poorly implemented token-based, whatever analysis. And having something more comprehensible, a bit more, with a bit more finesse than what my Python skills has would be a tremendous help. So I would be very much interested in helping with that. So the idea is what we want to do is basically once we have established the way how we are doing that, so create that pattern database and make it open so people can actually look at it and help with the documentation and things like that. And extend it, of course. Basically, internally for the energy version of the thing, we've developed tools that are basically coming from traces. We parse the trace data. We convert those in Panda frames. His Python stuff is basically creating a database of what's happening in the trace. And then basically we use that data to do different sorts of stuff. So for example, you can plot any signal. For example, no utilization signal or a task. You can plot that thing or you can plot your frequencies. And actually on top of that, we actually also build its behavior analysis tool. So you can actually, there is kind of the way to specify, I don't know, I want to check that this task didn't exceed 10% of utilization across the trace. So you can actually create layer regression testing. I don't know if it's something that you can be interested in. Yeah, sure. I mean, it's related or kind of related. We certainly want to look at that. Well, how are you doing that? I'm doing a link. Yeah, great. So any other ideas people have, how we could tackle that? Well, by the way, you said, well, we'll be back. Yeah, I'll find it. Or I'll get back to you. I know how to find you. So Paul, you had something. There's been a, depending on what you're trying to do, there's been some attempts in various places to try to map from code patterns to all sorts of things. A lot of them have been less sophisticated and have kind of had a shelf life. So they've worked for a while and then people have invented smarter bugs that avoid the patterns but are still bugs. Right. So is this intended to be kind of an ongoing thing? Yeah, I mean, once it's out there and you get something where if you look at an incident where we have this long latency thing, and then the pattern matching algorithm says, hey, there is something I do not even know about. So that gives you a blank. So but then the tool should be able to create an abstract description of this which you can match in the next one. You integrate it into the database and somebody has to fill in the documentation for it, of course, somebody who understands it. Right. And that's, I mean, that's the whole idea that we can extend the database over time because I'm not going to find all patterns just by running it on a gazillion of machines. No, no, I can't let you find all the patterns. Well, you might ask Watson for it. I can ask all the patterns. They just won't be useful at all. Is this something that, at what point would you be interested in me pointing various random people at it, some of whom may be useful and some of them not? Yeah, at the point where we actually have something which is not only consisting out of a bunch of totally ugly Python scripts and shell scripts and whatever. You should take it too easy on them. No, I mean, if I steer at the stuff and it makes my eyes bleed, I don't want to expose it to other people. OK, let me know when you do that. I mean, I don't want to give other people an excuse that they can show me back their kernel patches which make my eyes bleed again. No, no, no, I'm not going there. Well, let me know when you're at that point. Yeah, yeah, it's going to be public. So any other questions or ideas? Anyone with experience in that area doing data mining? Thanks. In our experience, the typical thing is first you want to spot the problem. And very often it will be that you have a missed deadline. If you have a trigger there, then you can get some more tracing. You can even get sometimes a running background and Intel PT hardware trace. And then you trigger a snapshot and take a copy of that. So once you have that, then there are various tools. Pattern matching is one of the tools. But if you have a hardware trace, you can go in the debugger and see exactly what happened. But it could be a long stretch. Another thing that is extremely interesting is from the time where my period starts to the point where I actually exceeded my deadline and triggered a snapshot, if you look at your trace of scheduling switches and priority changes and everything, if you ask for a critical path analysis and we have that in trace compass, the results usually are extremely interesting. You see exactly what happened and what caused what in terms of blockings. And you have just that chain of events that made it take so much time between the time where you were ready to run and the time you actually didn't finish in time. You're right. Then we also have experimented with pattern matching. And pattern matching, indeed, does provide some interesting results. But then the patterns that we have usually are relatively simple things like, you should never be preempted when it's your main real time task. You should be using all the CPU. There are a number of simple patterns like that. And then all the violations, if you can pinpoint them to specific events or very short regions, we've had some success with that as well. But it's more recent work, and we have to work more on that. But indeed, I believe that everyone is telling us that the traces are huge, and they don't want to look at every event by hand. So you need to have some interesting tools to pinpoint the most important stuff. I mean, most people I know have tons of horrible pearl scripts, python scripts, sad and orc. I just make them up as I go along. Yeah, of course, because you're mostly looking at the text-based traces, so you always have to do the extra regex scrap for everything. Yeah, we're trying to do that on the CTF format, because doing that on the text-based representation is just horrible. A graphical view of the critical path, it's very nice. I know that scripts are fine and so on, but if you have a graphical interface with a graphical view of the critical path, can zoom and look. Yeah, I mean, that's nice. That's a lot. But the idea behind that is that I want to give people, without having to install and figure out how to use trace compass, a very quick, at least hint what they are dealing with. And if we can extend the pattern space and make it big enough that it covers also the more complex cases, because the more complex cases are basically a combination of simple patterns. So you have those. You run into a log contention five times in a row. You get migrated, preempted, run into a log contention, whatever, any combination of all of those and some more. And so if we are able to have these simple patterns identifiable by a machine, then we can start looking at combinations, because then we can say, OK, look for any combination of those and then stitch them together. And you still get out a very clear description of what happens from here to here. So here you got preempted. Here you ran into a contended log. Then the interrupt came in and the handler took over an hour. And then he finally reached user space, where you spent a gazillion of time exceeding a timeline anyway and then came back. So kind of these things, I think, can be just done simple for a quick thing if you have to dig deeper. I mean, that's what the traces are going to tell you. And then you're lost anyway if you don't have detailed information like hardware traces or function traces or whatever. But what we really want to look at is, especially if you do flight recording and you just keep streaming out data in a permanent way that you can actually find certain things which you do not expect to happen. Or you say, they shouldn't happen. But maybe they happen once in a month. And so that's more on the failure prediction side, I think, where we can make use of that as well. Because that's going to give you a quick thing, the quick hint. Oh, there is actually a lot of contention in the way. It only happens once in a blue moon, but it is there. So go look for it. And then you have to instrument deeper and actually looking what kind of lock it is and where you can't handle it and things like that. I mean, it's not going to make the stuff magically going to be very, very easy and just, oh, yes, it's file X, code line number 17. And here is the description of the race. Great, yeah, I would love that. But it's not going to happen. But it's meant to aid things, to make things a little bit more easier for people. OK, any other opinions on that? Let me look at the schedule where we are. I don't know. And that does tell me what? That does tell me we're ahead of coffee break. So coffee break is supposed to be in 20 minutes. But if there's no more questions on that, I'm happy to let you out in your well-earned coffee break. And we're back looking at something else which might fry your brain. At least it fries my brain in a regular base, if I think about it. At 4.10 for discussing problems with few texts we've seen and what we still have to fix on the Chilip Sea side or what the kernel needs to do to make Chilip Sea's life easier. See you later.