 Okay. Well, I think that's close enough to 11.30. I'd like to welcome everybody to the last session before lunch. Everybody will be on edge and ready to pounce. My name is Stan Shubbs. I work for Mentor Graphics, the so-called Code Sorcery Division of Mentor. Mark Mitchell's company, which was bought a couple of years ago and absorbed into Mentor's embedded software division that includes accelerated technology guys and a bunch of others absorbed companies whose names I don't remember. Actually, most recently, we have picked up Montevista's automotive Linux guys, taken them off Cavium's hands because really, Cavium's a networking company didn't want to do automotive. Anyway, I'll talk a little bit about GB work that's been ongoing and gradually progressing from the things that actually work to raw dough that your mother warned you not to eat. So, just as a start as a background, GB GNU debugger has been a part of the GNU system since around 1986 or so, when it actually first appeared as a sub-directory in the GNU-EMACS distribution. We don't actually have much history going back that far because no one actually had source control or kept very much track or even said anything about what they were doing just showed up. It originally was just set up to be a native debugger for Unix-type systems got early on extended for cross debugging, and since gone through a number of major redesigns and rewrites. Target Vector, which I'll talk about more in a bit, is abstraction for the back end, objects to represent the frame, stack frames, GB Arch to represent target architectures as objects rather than piles of macros as is familiar even today for people that work with GCC. In fact, just as an amusement, a couple of times I've done a diff of the lines from an earliest version of GCC to nowadays and there's literally just a couple of pages worth of lines that are actually still the same and so pretty much everything has been worked over. It's been the default debugger for Linux since forever. There is a work on an LLVM debugger called LLDB and it's a possible successor needs regular patching to be made to work on Linux. Apple uses it as their default debugger now as part of Xcode. But unsurprisingly, Apple is not quite as energetic about maintaining a Linux port. So just a quick review of the overall big picture of how it's put together. At the top we have the different user interfaces, command line interface, the machine interface, which is kind of a more heavily sugar version of the command line interface that Eclipse uses. GB, TK, Agui, and TUI, which is a terminal user interface, which is a curses based interface. On the one side, we have the so-called symbol side, and this is all symbol table handling. The BFD is the main intake library. This is what actually knows how the object files look, and it dumps basically blocks of raw symbol data into handlers for different file formats. At this point, pretty much Elf is really the only live one. There used to be a lot more. Then also an add-on for the symbolic information, which is kind of an M by N with the symbol information often not entirely. So in theory, you could have ADOT out format with dwarf information and weird stuff like that. In practice, really, mostly what people do nowadays, it's Elf and dwarf. On the other side, we have the specifics of how you interact with targets. So for instance, Linux NAT is the code that does Linux native specifically. exec is short for getting information out of executable. So for instance, if you wanted to read a initialized variable before running a program, you can, you just print the value and it digs it out of the executable treating it as if it were a target system, similar for core files. There's a whole bunch of other protocols for across debugging, but the star of the show is remote.c, which has never been given a name. It's people will say the remote protocol or RSP or whatever. Actually, a lot of times I just call it the remote.c protocol because that's the file it's from. Maybe that's the default remote debugging protocol, and then it has a lower level that can go through TCP or Unix or whatever. Then on the target side, if you're not talking to the target directly somehow, there's a GDB server that is very often used in the embedded Linux case, where GDB server sits on the target, and it does the target side of the remote protocol. The other pieces, we have a collection of languages. The language support does parsing, that sort of thing. And then GDB arch objects, which define things like, what's the byte pattern of the trap instruction? How do you take apart a prolog? How do you interpret a stack frame and so forth? And then finally in the middle is the execution control, and this is what actually implements the algorithms that are at the heart of debugging. So to take an example, the single step process. Single step at an instruction level, it basically goes straight through. You say step i or si, and it basically sends a single command to the target, and then the target does whatever is target specific to single step. For instance, the Linux native that amounts to a single ptrace call. But to say single step by source lines, there's actually an iterative process, a single step by instruction, reads the PC back, goes back to the symbol table and says, is the PC now pointing to a different source line? If it points to a different source line, it says, oh, I'm done with single stepping. So that's actually one of the reasons that the single stepping optimized code is funny, because the compiler has taken the single source line and scattered bits of it all over the program. And so the GB algorithm just blindly says, oh, it's a different source line. Not realizing it was actually the same source line you ran a moment ago and just jumped you back there. So that's the execution control is all about those kind of algorithms. So the target system trends, as we saw yesterday, lots and lots and lots of Linux targets. We're up in the probably over a billion out there now. It goes everything from the low end, the phones, the tablets, the gadgets to the middle end, which is these automotive systems, infotainment systems, which are more or less desktop sized, but they're not actual desktops. And then we have the high end, the backbone switches and compute farms and so on that are pretty big iron. And desktop, yes, all the enthusiasts have desktop Linux and nobody else does. So coming down to the multi-core aspect of it, right now, two to 16 cores is pretty commonplace. Different people have different numbers of cores. We go around the room and look at everybody's laptops. There's anywhere from two to eight desktops will have more. In the lab, though, as with Activa, Teller, and so forth, there's quite a few higher numbers of cores being experimented with. And I didn't, there's actually some thousand core systems that have been kicked around, although some of them cheat and they're actually like hyperthreaded for a thousand threads and it's really not that many, not quite as many physical cores. This has some implications for debugging. Because you have a lot more stuff going on in the target, a lot more stuff going on simultaneously. And there's a number of approaches we could take to this. One approach is to do nothing. You say the kernel conceals the fact that there are multiple cores, it allocates the threads to the cores in whatever way it feels like, and GB already knows how to do threads and Ptrace does the hard work of connecting the threads to the debugger. So you're good. This isn't necessarily a great approach, it starts to break down as the number of cores increases. One of the things that can happen is you have an application that has core affinities. So you'll have threads that are actually allocated to specific cores and you could have a situation where say there's a bug in the system and the one core is being good and the error core is being flaky. And it's the fact that the threads are on particular cores is where the root cause of the problem. There's also designs that we saw a couple yesterday that people were putting up on the screen showing heterogeneous cores where they may all be arms but you have like A4s as a sort of a compute fabric and then an A9 as a main processor. So another thing where the thing that starts to happen is as you have many more cores you can have many more threads than you had before. And so another approach that we'll talk about in a little bit is to generalize the commands to operate on sets of threads or sets of cores rather than one at a time. And then finally in the most extreme case we can start shifting work to the target. So instead of having the host GDP have to do everything and have an immense amount of traffic going to and from it the to shift that somehow get it off the network and get it onto the target somehow. And I'll go more in a little bit about what all that means. So as the illustration of what's going on here so in a single core system with a bunch of threads what's really happening it doesn't overwhelm the debugger even if you have a lot of threads because ultimately everything is running once at a time. And so the X's here indicate breakpoint hits that even if you have the breakpoint that's all the threads hit the breakpoint is really still only being hit one at a time and there's a little bit of interval between one and the next and it could even be long intervals if there's other activities going on. Plus which if GDB is running native on that system some of the gaps in between this reds running is where GDB itself is being swapped in and is essentially throttling the rate at which the breakpoints are hit. But if you have a four core system you know those breakpoints could all be hit literally simultaneously so now you have four requests for breakpoint handling showing up to GDB at exactly the same moment. And you know what do you do about that? You end up with a, you can end up with sort of a congested system all wrapped around with GDB as the bottleneck. So the first thing that this is in the category of something that actually already works is to only try and operate on one thread at a time. The old GDB behavior was to stop the entire system and if you think back in 1986 there wasn't a whole lot of thread programming going on. So the entire process was brought to a halt and you stepped and then brought it all to a halt again. And this sort of generally works for threads except for all the real time threads that time out because you're sitting at a breakpoint. So several years ago we introduced this nonstop mode where you can leave a number of the threads running and just start and stop an individual thread. You can still ask to interrupt all the threads, you can still ask to just stop one thread and then continue all the others and so forth. So this has been available for some time. I don't know how many people actually use this or have used this mode. Yeah, I had a feeling it's not been widely advertised. The way it works is you can't do the usual dance of removing the trap instruction for the breakpoint reinserting it because even if it's only one instruction that you're single stepping past, if you have a whole bunch of other threads running through they could all sort of escape and get past the breakpoint at that moment. So we added a backend thing to GDB to do the so-called display stepping. So the way that works is we leave the trap instruction in place, but the instruction that was overwritten, we copy it over to a different location. We modified if it's at all position dependent and for instance say a relative memory reference. And then after that there's a jump back to the following instruction if necessary. If the instruction being written over as a jump instruction then we just modify the target of the jump. So this allows the trap instructions to stay in place continuously and to catch all of the hits. So the user interface for this multiple thread situation is still basically a one thread at a time kind of operation, even in nonstop mode where you can say continue all the threads or stop all the threads, you do still have to say either all the threads or just one of them. And if you say you wanted to step one thread and then another, you actually have to switch to a thread using the thread command, step it, switch to a different thread, step it, switch to a different thread, step it, and so forth. If the threads are at all organized or coordinated anyway you get kind of a chaotic situation where eventually the program makes forward progress but it's behavior is not at all like how it works in real life. So if everything is interlocked in a way that there's no timing dependencies, this kind of sort of works. If there's any interdependencies you'll get so many timeouts that this essentially becomes unusable. And so it's the kind of thing that'll work fine for a worker in a consumer thread, maybe a couple worker threads, but anything past that, it just really breaks down. So the proposed change to this is to extend the GDB commands to work on sets. And this is being a little bit vague about what the sets are actually called because we've had a couple different versions of the name and we've managed to generate all kinds of confusion and we haven't even finished implementing it yet. But the general theory is that it's consists of the thread, the set is consists of processes.threads at cores. And this is a fairly general framework in that you can explicitly name the threads or processes. You can explicitly name, you can specify them by ranges. You can use predicates, you can do set operations of the usual sorts and then you can define name sets and then use those in commands. So just to give a couple examples, the, you want to say all threads of a particular PID, say 4563.star, can specify that to say, tie it down to a specific subset of cores. So you have 40 cores, you only want to look at the first 20. You can add an apt notation. And in case anybody's wondering why they're in different colors, it's because AtSign wasn't a great choice for doing PowerPoint type presentations because all the software thinks it's an email address. So, and in fact, it's horribly difficult to edit it because it really, really wants to treat it as a single object. And it's like, thank you presentation software. Anyway, so there's no deeper significance to that. And for some reason, LibreOffice doesn't really know about quoting either. Otherwise they put some backslashes in or something. Anyway, so another possibility is if you had name threads, you could say something like star.signalshaper and that would actually designate the same, thread of the same name among a whole set of processes. So you had a situation where you fork several processes and they're all generally the same threads with them. You could set a break point that would actually apply to all the processes. A mechanism as yet undetermined. Yeah, there's a little bit of theoretical aspect. There's a kind of a POSIX thing where you can apply a name to a thread and nobody has that, that's true. Yeah, yeah, this is why I say it's getting a little bit closer to the half-baked range. There are targets, RTOS and so forth, that do have name threads that are built in as part of their design. And another part of the set notation is that you can just omit things. So you can omit, say, if you just want to refer to the current process, you can just omit the process specification, just say 1-20. So to get some examples, the original idea I had for this actually was to use kind of a square bracket notation as a prefix to the commands. And it turns out the Swedes didn't like that at all. They say, well, why the Swedish? Well, if you've ever seen a Swedish keyboard, doing a square bracket is not easy. So they didn't really like the idea that you would have to do a square bracket in front of every command to do multiple core commands. So at the moment it's been left a little bit vague as to what the final syntax is going to be. So for the purpose of concreteness here, I'm using this as an argument to the commands, since that's something that we can fit into the GB syntax. So our first example, single step, all threads of all processes that are currently attached to core six. So that would be if any process that had their affinity set to core six. And so that's the one they were using. Chabelle do something like it continue, specify a specific set of threads, include both by numbers and names. If you can't remember, say the thread named worker is given different numbers, depending on the way the thread allocation works. So this particular continue command solves this problem of having to start individual threads one at a time and then get into the ordering difficulties. We have a existing command, existing modification to the break command where you can say break give a function and specify a thread. But right now it's currently limited you can only resume a particular thread. The break point can only be attached to a particular thread. It's not very easy to use if you want that break point to trigger on several threads but not all of them. So this will be this set notation give us a way to fix that long standing problem. So the whole set idea is actually not original with anybody in the GB world. This actually came from a total view and was a quasi standard for this high performance debugging forum that was briefly active in the mid to late 90s. And they actually produced a sort of a standard and then the whole project kind of disappeared. So you can actually find the HPDF spec out on the net and I think we made a copy onto the GB wiki just for documentation purposes. But in any case, to see a real live example you can actually go look at the total views manual and they actually have a very similar notation worked out in great detail. So I originally put this idea up for GB and a couple of years ago and first calling them process thread sets as an HPDF and then there was interested in doing this for cores. Mostly in an embedded context where the cores were not all homogenous. And so we extended that to column PTC sets, process thread core sets. And then the implementation says, well, you know, in GB we generalize this to call them inferiors. And in general, we call the processes that GB is operating on inferiors. And the traditional belief is that they're called that because the child of the forked process is called an inferior process. But actually in the debugger world we call them the inferior processes because, well, they're inferior, right? They have bugs. If they didn't have bugs, we wouldn't be running them in the debugger. So by definition, they're all inferior processes compared to GB. So anyway, so that was the idea to call them inferior thread sets or IT sets for short. And in retrospect, that probably wasn't a great choice because really they an embedded system, sort of a bare metal target is an inferior in the GB sense, but it's not really a process. So it's kind of, it's a little bit confused and it hasn't been decided what we're really gonna call them. In any case, Pedro Alves worked hard on this and got the work in progress pushed onto a GB branch and they couldn't take it anymore and so he escaped from mentor to go to Red Hat and left the work behind. And there's been a little bit of work on the infrastructure since then. And the reason there's the, while the data structure is a data structure for set operations is relatively straightforward. The, a lot of the old code and execute the execution control in GB is still has a lot of implicit assumptions about the single threading of control. And so that essentially needs to be rewritten to accommodate the possibility of multiple threads running simultaneously. And so that's been going on. It's not been not real speedy. We've got an update patch that Yao Qi has been working on he's been holding it off until 7.6 has been released so everybody doesn't get distracted. So check our training. Okay, so that's the set stuff and hopefully we'll see it before the end of the new millennium. Anyway, so I made reference before the host as a bottleneck and GB is really like a duck. When you do the single step command, I mean there's a lot that goes on behind the scenes. The getting a register, if you ever turn on the debugging any of the debug remote flags and you can see just how much traffic is going back and forth. And so there's a lot of reading of pieces of memory looking at the piece of memory saying, well I should de-reference that because it's actually a stack pointer to make it get a piece of stack. That's referring to another variable. Oh, I need to go get that. And it pretty much does everything one at a time. And if you had a single core target even with multiple threads as the diagram earlier showed since only one thread at a time is really needing attention every all their threads are suspended and they don't have to do this traffic back and forth. So there's a nice self-throttling effect. So even if there's a lot of traffic, your single step performance isn't really degraded. But again, on a multi-core system if the same thing happens you're gonna have hundreds of threads hitting that same breakpoint. They're all wanting the same kind of attention. You now have a huge number of packets or P-trace calls. Even though a P-trace call is native they still have to switch the context of the OS dig stuff out of the other process and then switch context back. So even if there's not network traffic there's still the java bottleneck type situation. So the idea is to get the host out of the critical path by moving work to the target somehow. And so it's a little bit of a strange idea. It was partly inspired by the trace point work that we did several years ago in which you essentially can collect data without stopping the program. So the GB sets up the trace points, installs them on the target. The target does all the work and then later on you collect the data, go back and pick up the data that you collected. And so the way this works then is that you have some kind of an agent. So the original stub is just get a byte, return it but now the agent then can do more work. And in fact it got with the trace points it actually got as far as designing a small bytecode engine and sticking it in the agent because you had say complicated variables you had to calculate an offset based on the data type of the variable, say an array. And so you needed some way to do the computation on the target side without getting GB involved. So the ways to do this target agent now that's doing more work, you can do a static link. We did this for one customer and the program had seven threads and we added an eighth thread and the eighth thread was actually a debugging stub and it controlled the other seven threads with signals. And so it worked really well. It's unfortunate we couldn't give it out to the world. It was very customized to their application. But it was a nice little scheme where the program didn't require a GB server. There was no context switching required because the, oh you've got to be kidding me. That Linux software. We'll see how long this lasts. Okay, yeah, so anyway, so one way to do this is statically linking to a dedicated thread. Another way, and this is what actually GB server does now, is that it has a copy of the agent or more exactly it has two copies of the agent, one which sits in GB server and a second copy which sits in the application as a dynamically loaded library. And so it kind of manages all that behind the scenes. So this is actually not a totally new idea. We've had some time about five, 10 years ago, we introduced a Z packet to the remote protocol to set breakpoints. And what it did was it moved the breakpoint trap management to the target. So instead of having to go back and forth inserting the reading instruction underneath the trap instruction, inserting a trap instruction, inserting the reinserting original instruction, you send a single packet to the target and it does the rest of the work. So one more recently, we actually took this to the breakpoint conditions. So you have something that looks like breakfood.c line 45, but only if global variable is more than 92. And the way this works by default is that the breakpoint is always hit, comes back to GB. GB evaluates the global variable comparison, which means that it's pulling the data from the global variable. And then if the condition is false, then GB continues on. So again, duck style, all this chatter is going back and forth and the ultimate result is the program continues. Well, okay, it continues after having had its time wasted for some fraction of a second. So again, coming back to the 100 threads, you have 100 threads all marking time waiting for their condition to be evaluated. It goes from an unnoticeable nuisance to everything is dead in the water for a noticeable amount of time. So we introduced these target side breakpoints. We borrowed the trace point mechanism, compiles the conditional code to byte codes, sends it on down. And again, with the help of the suitable target agent which includes GB server, it can do that all by itself. So it can't do everything. So for instance, if you have a reference to a convenience variable, which is only maintained inside GB, it still has to come back to the host for that. But for a lot of kinds of real life conditions, it works pretty well. And now it's actually been in and was in the last release of GB, GB 7.5. So anyway, so now we're getting to the really unbaked material. So a lot of this stuff is all based around trying to hack up the remote protocol and trying to redesign how things work. Now, this protocol was really designed for 68,000 debugging over a serial line running at 1200 BOD. And you could only assume there was a few kilobytes of memory on the target that was even available for you to work with. So if you've ever looked at the example stubs and GDB sources, you'll see some effort has been made to make them as small as possible and do as little as possible. So literally it's brought down to the single letter command G to get all registers, which is all fine for a 68,000 or an x86. It's really not so good for a power PC or an arm with a whole bunch of optional register units all loaded up. Or for say an extensive, which has 2,000 bit registers and lots of them. So the G packet has to return everything. So to reiterate, there's any lots of packets even to do a simple back trace, we got all the overheads and then we multiply that by a number of cores. So it's kind of a situation where doing a thin protocol doesn't really seem like it was a great idea anymore. So as part of that, the multicore association who originators at MPI and then the common trace format has had a working group now for about a year or so developing a general services framework aimed at mainly debugging and tracing and that sort of thing. And it's not a very open group that is anybody can join, can participate but they don't put anything out of the net which I don't entirely agree with. So this is just me speaking, there's really been no actual standard put forth that's still at the discussion point. The general requirements, I see all the Anas Merkin there. I know what you're thinking, I'm gonna get to you. So anyway. Yeah, the general expectations that would be cross-platform, cross-technology, there's a desire to have a discovery. So that is instead of having to know what the target is ahead of time that you can essentially go out of the net and pick out fine targets that are amenable to debugging. And of course this is also part of the secret plan for world domination because lots of Cisco routers have GB backdoors in them already. It'd be handy to like discover them all at once. The protocol needs to have a high performance and it's not so much for debugging because as much as I talk about the traffic, we're still talking about several hundreds or maybe several tens of kilobytes of data but the air client is tracing tools that can dump multiple gigabytes in a real hurry. And then another part of this is that we want to coordinate different host tools. Right now there's a little bit of a difficulty of say GB is trying to control the target and something like LTGNG is trying to return trace data. If the target is stopped because you said stop in GB, LTGNG is not getting its data. So in the GB context to the extent that we've figured out what this is gonna look like, one thing I think we would like to do is to extend the target command to have URIs. So here we say MCSF as a protocol, we can do kind of the slashes as the routing to an actual target and then the PTC sets could actually be an argument coming out at the end of the URI. So in this case, process 456 and all its threads. The thing is service oriented so there's a number of services and the services are defined by the target. And for debugging it seems to boil down to about four main services that you need. And the first is to get state, set it, execute control and then instrumentation pretty much subsumes everything that is gonna run independently, break points, watch points and so forth. And the key thing here that makes it different from the old remote protocol is that something like the get is allowed to have an arbitrarily complex set of arguments. So in this case I threw out as example get register zero through seven from cores one through six memory location two and then 16 bytes after memory location three. And all of this is essentially a single packet and the target does all the work to compose that up, however large that is, returns it all at once. So we go from a situation where we have dozens of packets to just one. So that's the theory that this actually hasn't been implemented. So here's the artist conception in the way that the hypothetical things have the amazing artist conception that makes you think how it's really going to work. So we have like two host tools, Mesa's mentors, a sorcery analyzer tool which is really an Eclipse wrapper for a trace tool like LTGNG. And then it can talk to proxies and proxies are essentially that just communication forwardings to other targets. And then, and this subsumes things like JTAG probes and so forth. And then to look deeper into a target the target will can have an agent and agent has a bytecode engine. And then the cores are either talking to the bytecode engine that is they're not interacting with GDB and then hopefully just from time to time the core actually needs to interact with GDB and so it'll connect through to the host. So yes, this is a more complicated agent. So anyway, that's the general picture and now we're really at the raw stuff. So it'll take this opportunity to solicit some reactions.