 So next up is Jeremy Kerr, we have a talk on the power architecture. Thank you very much. My name is Jeremy Kerr, I work for the IBM Linux Technology Centre, and before we get started I have a bit of a confession. Actually, it's been the last year working on less kernel and more firmware. I'm allowed to say the F word, right? So firmware doesn't typically have a great reputation amongst system implementers and kernel developers. Not typically open, modifiable, verifiable. Generally you get a binary bit of firmware that does things like introduce interesting bugs into your environment. So what I'm going to talk about today is how we're trying to address this in open power and a bit of a general set of information about what open power is. And I guess we'll start there with a broad overview of open power. It's three things basically. Firstly, a system architecture based on the IBM power designs. It's a collaboration of development of hardware, system and firmware between the open power foundation's member parties. And thirdly, it's a Linux-based platform generally targeted in the cloud, the server market, HBC as well. And we'll go into how the lineage defines that in a little bit. So three essential things. As I mentioned, there is an open power foundation at openpowerfoundation.org. And the website there is where they track the members and basically all the designs and things end up in the open power foundation hopefully available on the website. I did mention earlier that the open power is an architecture as well. The core of which is power8. There's been a few talks on power8 over both this LCA and previous. If you haven't seen Paul McCarris' one earlier this week, let's just go and download it. But the first open power machines are based on power8 processors. They're quite interesting from a CPU architecture point of view. We have up to 96 threads per socket, some interesting transactional memory capabilities and a new function called CAPI where devices can participate in the coherency protocol between memory and CPU. So some interesting things there. Paul, we're sort of concentrating on the Linux side now. And what does power8 mean for the software stack? To find out, we'll sort of step back in time a little bit here, back to the pre-power4 days when the Earth's Earthened Magma was calling. The architecture itself defines what we call the machine state register. All power machines have this machine state register or MSR. It's described by the power architecture definition. And it defines what state the machine is, funnily enough. Things like whether the floating point unit is available, whether external exceptions will enter the processor. And one of the bits in there is called a PR bit for problem state. So if that bit is one, you're in supervisor mode or operating system. Sorry, for bit is one, you're in problem state mode, which is generally user space. You're in supervisor mode, which is the operating system. So this bit controls whether you can do certain things like execute privilege instructions or access certain bits of memory. Can you access memory with translation turned off, and hence access all of memory? Or if it's one, you're in their usual constraints of running a user space process where you have your only a certain set of translations and you can't execute arbitrary instructions. You have a smaller set to execute. So this is the pre-power4 days. Sometime around power4, the hardware folks introduce a new bit into the MSR, called HV or hypervisor. And like the PR bit, it controlled whether you have access to certain resources, certain instructions or certain translation mappings. So again in the table here, we have, if you're running with HV equals one, you're in hypervisor mode. Or if you're running in HV equals zero you're in hypervisor mode, and if you're running in PR one, then you're in user space. However, the hardware folks didn't really tell anyone outside of IBM at the time. So we didn't really have this HV equals one bit available. It was not listed in the power architecture. Well it was listed in the power architecture, just as a reserve bit, there was no mention of hypervisors or anything like that. So it wasn't really available for general use. Now this wasn't a huge problem. This was way before we had any open source hypervisor code to run before any sort of 4K VM was out. And before hypervisors were a huge thing in our world. So essentially this made a Linux running on power would always run in what would be a guest environment. And if Linux or if your operating system isn't using it, what's it for? We have this thing called PowerVM. Well it's called PowerVM now. It may have been called something back in those days. Maybe perhaps. The power hypervisor. Right. And what's it doing? So that we have originally a very light hypervisor, not necessarily visible. Again because we didn't see this thing described in the architecture. It wasn't really visible. And essentially we're running on a system with PR equals 1, PR equals 0 mode. Except we just had one bit in the register which was marked reserved, always set to 0 because we're always running in non-HV mode. So this meant things like all of our benchmarks we'd run on powerful hardware were always running with this very light hypervisor going. And no one kind of objected to it. There wasn't a huge amount of effect on the system. And we got quite good performance even with this tiny hypervisor running, running out our normal OS in what would be a guest mode. Over time over the generations of power systems the hypervisor grew more functionality, made it into more of a complete sort of virtualization solution. So you could buy one of these power machines and it's pretty much like a ready-to-go guest running operating system. You could put multiple OSes on it running at the same time. But of course this hypervisor is IBM internal. So it meant that you weren't writing code for the hypervisor, you were writing code for the guests all the time. The management systems for this are proprietary. You'd get these great bits of software where you could drag and drop guest OSes around and that sort of thing, but it's all proprietary. And there's no way to run any Linux code or any one-of-your-like code in this HV equals 1 mode. Now again it wasn't a huge problem. There wasn't a lot of code that you would want to run in HV equals 1 mode. But over the time, over the generations of power processes more and more open source virtualization solutions like KVM were coming out and we'd like to run those. We have a system that's capable of this hypervisor mode. We have a hypervisor, we'd like to put the two together and make it a nice system. So the big change for power rate is that we now don't have this hypervisor layer between Linux and the operating system. There's nothing separating the OS from the hardware. There's no higher privilege code than the OS. There's nothing that is going to steal your cycles or anything like that. And there's no bits of hardware that you can't access directly from the OS, which is a big thing for us. We like having access to the entire machine. We want to make sure that we know what's running on the machine and we want complete control of what our systems are running. So using this, we now also have access to all the functionality that we get with this hypervisor mode. So we can use KVM and use the facilities of the architecture to implement a Linux-based hypervisor. I mean, we're not exactly like this. We do need a little bit of firmware. But in this case, the firmware is less of a separation of OS from hardware and more of just a bit of a hardware abstraction layer. I'll go into that in a little bit. So I guess similar to what you'd think of as an X86 machine where you don't have a giant bit of software that's between the OS and the hardware, except now on X86, we're seeing this kind of thing happen where we do have a higher privilege layer of code by the SM BIOS, which is doing things like handling certain interrupts that never go to the OS. It will receive an SMI, system management interrupt, go into SM system management mode, execute some magic code that you have no idea what it is, and then possibly return to the OS or elsewhere. So we've gone to this model here where we no longer have a hypervisor. We have Linux running in HP equals one mode and a bit of firmware to sort of help that out. But the important thing now is it's all open source. So the entire stack that you can run on a... The entire stack you can run on one of these machines is verifiable. It's editable. You can see what software your machine is running from the first instruction onwards, which is really important for us, especially with the... I'm sure Matthew has told us last LCA about the importance of knowing what you're running and also being able to modify that and making the system do what you want it to do. So, again, you buy an open power system, you run code and the first instruction from there from boot is yours. As well as being open source, it means that it's available for others to implement. So we have this source repository on GitHub, but part of the open power architecture being licensed at all is that others can implement open power hardware, which is also a big thing for us at IBM. So the first example here is the Google's recent announcement... I think it's getting on a bit now, but their announcement of their custom open power board based on two Power8 CPUs. So they designed the board. They also modified the open power firmware to make... to accommodate the hardware that they've designed and have built an open power system in what we think is pretty quick time. And because they're able to modify the hardware in the firmware, they've customized the hardware itself to their workload, which I assume they're pretty happy about. One of the other designs based on this is the Tyan open power reference system. This is something that you can buy online at the moment, I believe, I haven't tried, but it's there. It's probably less expensive than traditional IBM Power hardware. And it is listed as experimental, so there will be bugs, there will be things that aren't implemented yet because the firmware is still being developed. One of the interesting things about this platform as well is that it's based on a standard ATX motherboard. So in theory, you could pull it out of that server and put it into your desk side system and have something really noisy next to you. Maybe Tyan will sell the motherboard separately, I'm not sure, but interestingly, we're sort of moving towards a bit more of a commodity part in the power design, which is sort of unheard of in the power seven days. So also interesting for us that this is the first time that we'd have non-IBM companies releasing power hardware, which is great. So a bit about the implementation, about what we've been doing over the last year or so and what our exposure to the firmware is. Firstly, the bits we need to build. So when you turn on your power machine, the first thing that starts executing, whether there's some little bits, but the main thing you start executing is hotspot. And this is responsible for early hardware specifically bringing up the caches, the memory, the clocks, the time tracking systems and whatnot. So this is a fairly self-contained thing now. It executes entirely and then passes control again entirely onto this new project called SkiBoot. It's executed by time SkiBoot has stopped entirely, so the system has been handed over to SkiBoot at that point. And SkiBoot does further machining initialization. Some of the things that HostBoot didn't do, it provides I think some more, basically the more the IO side of things, some interaction for PCI devices and then also produces the runtime interface for these machines. So again, we have a workload, we have an OS, we have this firmware down to the side and that firmware itself is SkiBoot at runtime. And less of a firmware than just a hardware abstraction layer in that it doesn't run in any extra privilege level than the kernel, it doesn't receive any interrupts. All interrupts are routed directly to Linux and then if the firmware is to handle that interrupt, Linux will then pass control back into the firmware to do what it needs to do. So its only entry point is a call from Linux. And the entry point is defined by a new firmware OS interface. It's not an x86 PC BIOS, it's not a UFI interface, it's basically just a set of function calls with a defined API. So it's really, it's no different than having a little library that your operating system can call to do certain hardware-dependent features. The, as a sort of fairly regular PowerPC firmware, it is device-tree-based. So a device-tree allows us to describe the hardware that's running to Linux and Linux will pass that device-tree to find out what the hardware looks like and probe the hardware based on the information of the device-tree. So SkiBoot is very small. Our plan is for it to be as compact as possible and to steal as few cycles as possible. And on steal, you actually give it cycles because you call it from Linux. But in that vein, we haven't implemented an operating system loader. So there is, SkiBoot doesn't have any functionality to execute your grub or execute your U-boot or whatever else because that would require reading grub from disk or it would require you're doing some network, implementing network stack to get a loader onto the system. So instead of implementing all this stuff in the firmware, what we've done is used SkiBoot as our operating system loader. And that means that we have a, basically the first thing that SkiBoot executes when it's finished in its initialization is a little embedded kernel that's burnt into, it's flashed onto the hardware itself. And this kernel contains little in-net ramfs which has some user space utilities that perform the functions of a boot of what you're expected by us to do to pass control to an operating system. So that means we use a standard Linux kernel for our boot loader. We have the standard Linux network stack. We don't have to write our own stack in firmware. We have all the standard Linux device drivers. We don't have to re-implement device drivers in firmware through our boot. We just have a little Linux kernel, plus a little bit of user space glue that loads the final operating system kernel and then kex-ex into that, into the, actually your Ubuntu or your Fedora or your world. So essentially our BIOS interface is a user space application. The UI that you use to set your boot order or set your network boot or everything is just implemented as an end-curses application running on Linux embedded in your flash. So in that regard, our firmware doesn't have a lot of functionality that's provided by these bits. And one of the justifications to doing this was bring up time. There's no way we could write an entire firmware with a network stack with device drivers from everything you'd want to boot from. We can just use what's there in Linux and go straight from there. So another one of the components we've worked on is the platform ports, the PowerNV platform port, NV for non-virtualized. And this is the bit of Linux that interacts with our new firmware API provided by Skiboot. The API itself is called Opal, OpenPower AbstractionLayer. And it's the base for PowerKVM hypervisor. So KVM running on Linux will use these Opal calls to interact with the hardware. The platform port is quite small. It wasn't a huge amount of code. It's not a huge amount of difference from the previous or the other PowerPC platform code. Another one of the bits of software that makes an OpenPower firmware is what's called the OnShipController code. Also open source, we have a little CPU on your CPU that is responsible for thermal control power management, electrical power management, and provides the hardware-specific back-end for our CPU freak driver. So we basically set some registers. The ACC reads that and updates the frequency system accordingly. One of the other parts of any sort of server-class machine is some sort of management controller. We use a BMC from another company, and this provides out-of-band management that you'd expect, be able to turn your machine on and off over the network, get a console over the network, and all the sort of things you'd expect from a server platform. It's fairly standard in the x86 world to be able to send IP my commands to your server and get it to start, stop, and it has a separate network interface. It has a separate MAC address to the rest of the system so you can basically define a management network alongside your server's network. I think Matthew's got a talk coming up about IP my, and then we're going to talk into this. We have a BMC, which is kind of what you expect on a server machine. And to title together, we have some build infrastructure called OP build. This is responsible for basically getting all the bits of source we need to produce an open-power firmware, basically compile them using a compiler that builds itself and package it into a single flash image that you can then burn to your machine. It's based on the existing build route project. Again, an open-source sort of package building system. And that constructs a few things. It constructs the Peteyboot user space environment, which is our boot loader. It constructs a kernel to execute that and all the firmware components that also need to go into flash. So it's all built in one step. Well, download and build. So this is how you would build an open-power firmware. It's all fairly unconventional. We clone it from this repo, change the directory, and then do a... It has a menu config style interface. We just load up the Palmetto DevConfig for a Palmetto machine and then build. OP build is basically just an alias for make. And there we go. Some notes on kind of the open-source side of things here. We have ways to customize the build. We could, say, build a different version of SkiBoot. I think 2.1 is completely majoring at this point, but this is how you would do it if you want to specify a particular version of a particular package. Sorry? We started at 4, so don't do this. So the idea here is that when you're building open-power firmware, you have complete control about which packages go into that. You also have complete control about where those packages are sourced from. So you can point the OP build system at an existing tree and you have... be incorporating that particular sources rather than what we've defined already. There are also many other different ways to control the build. It is a standard sort of builder environment. We've tacked some bits onto it, but there's nothing particularly special about it, so you can define lots of variables to, say, use this version or use that tree or even use a custom set of patches that you provided or just completely ignore the version control and build something that you've hacked up yourself. So there's a lot in there. There's a lot of documentation, or there's a bit of documentation in the GitHub pages about how you can go about doing that. So yeah, so basically what we have now is a firmware that you have complete control of, a system that if you like to hack on hardware, you could build your own, and that gives us a bit of a new way to build machines for, you know, custom data centres and custom cloud deployments without having to redefine the computing industry. So I think we're pretty happy about what's happened so far. There's still a long way to go on Polish and using enabling new CPU features, getting new hardware out there, but hopefully we've learned to like firmware just a little bit more. I've got plenty of time for questions. Usually there's a few feature requests and things, so please ask away. Wow. Just a question on the Petty Boost. That used to have a graphical user interface. Have you still kept up for some scenarios or is it all just curses now? It's still in the tree. I haven't built it for a long time now. So a bit of background here. So Petty Boost was originally built for the PS3. And the PS3, to the PlayStation 3 PowerPC machine with a little bit of flash, similar constraints. We want to be able to burn our kernel into flash and then boot your proper operating system from that. And we had like a nice sort of PS3 type UI that you could operate with a controller. And it was based on Keith Packard's lib twin. We've only recently got more than a text console on these power machines. So firstly, there hasn't been a lot of motivation to implement it. But it could well be reinstated. It's probably not the best sort of thing in a server environment where you want to have a low bandwidth link to your management. But it could definitely be done if necessary. If you can implement a 3D one, that would be even cooler. Just curious, you of course mentioned the fact that you guys are quite happy to have a full stack and you can only expand metal. Were customers asking for that a lot or were they just mostly the developers like we want this because it would make your job easier? I might have to let Anton come in and correct me if I'm making anything up here. But that was a big feature in why people wanted to use something that's open power based. They can control the hardware. They can control the entire software stack and not only customize it to what they want to do but know exactly what they're running. And doing things like finding out, I mean latency is a huge thing in large workload. Do you want to make your cloud, whatever, is only going to operate constrained by the worst latency system you have? So the idea here is that if there is something in a firmware that's taking up latency, in that case you can see what it's doing rather than having like... You can't measure it? Yeah, if it's a black box, you're less able to measure, you're less able to verify why your latency is, why you're getting jitter on the OS. And also just knowing what the hell is happening on your machines is a big thing. Anton, is that going to cover this? Yeah, I was just going to explain that what Jeremy's been talking about is the open power machines that will be produced by vendors other than IBM and perhaps by IBM in the future. That the current Power 8 machine that you can go and buy is not exactly like Jeremy described. It still has the Opal firmware but it has a service processor instead of the BMC and the service processor will only flash a signed image that's been signed by IBM. So you don't have that amount of freedom to modify things on the IBM Power 8 systems. However, the Opal ski boot stuff is still open source and you can get the source of it and so you can see what's there. And it's still true that you're running like this picture with the OS running directly on the bare hardware and this Opal firmware layer providing a library of services rather than being a layer that gets between you and the hardware. Yeah, I just need to clarify or expand on something Paul said there is that open power machines don't encompass the entire set of Power 8 machines. There is Power 8 open power and Power 8 IBM power. And the open power ones are basically they're implementable for anyone but only IBM can sell an IBM power machine. So what I've been describing here is the open power side of things the IBM power is still based on the open source components we've been talking about but sort of comes in in a slightly different sort of management architecture that uses some more IBM components than these open power machines would. Yeah, thank you. So traditionally power hardware has been tightly coupled with AIX. I'm curious with this new firmware architecture is AIX also running on it or is it going to have a proprietary stack for those machines? So you're asking does AIX run on open power machines? I haven't seen it running I don't know of any plans to make it run. The open power foundation is focused on Linux so as Jeremy Kerr the conference attendee not Jeremy Kerr the other employee I'd be very surprised if that would happen. So that's probably a good point to say. Yeah, if you could, thank you. Just making it clear. So AIX only runs as a guest it requires a hypervisor and only PowerVM is supported so AIX is only supported on PowerVM at the moment. And PowerVM does not run on the open power machines. On the IBM Power8 machines you currently get a selection in the Linux only ones you get a selection in the service process of WebMenu whether you want to run Opal and all of the stuff we've been describing or alternatively PowerVM so you can power on the machine with PowerVM and then you're just in the PowerVM world like we have been for years. So that's the sort of dichotomy. And if you're an AIX customer then you're over on the PowerVM side. Yeah, so the focus of OpenPower has very much been kind of the massive scale of Linux machines rather than your, I guess, cathedral versus bizarre sort of thing. You want your cathedral of IBM AIX traditional IBM PowerVM machines versus your bizarre of all kind of things we can implement on OpenPower using different hardware designs, different software designs and things that you can control yourself. So that's about the gist of it, as I understand it. If I wanted to build an OpenPower machine do I have to have a whole bunch of IBM components or is it fairly limited? I'm not in the hardware business myself but I would say that you would at least need an IBM CPU and then you could probably build the rest yourself. So... The question was whether Mikey has access to a foundry and if so can we come and visit? And are we selling licenses to implement a new Power8 compliance CPU in not an IBM Foundry? Yeah. So the entire core is licensible. You can produce your own Mikey Power8 and go from there. I'm waiting for it to bat in breath. I've got a few resources in the slides. Slides will be up with the conference agenda after the conference I believe. So our GitHub repo, our mailing list and our little home in the kernel tree under Platforms PowerNB. Okay, no further questions. I can... Thank you very much. I'll give you your obligatory... Thank you.