 Welcome to another edition of RCE. This is Brock pale and again with Jeff squire some Cisco systems and the opening MPI project Jeff welcome to the show Hey, Brock. How's it going? All right? We have today a request. We got a long time ago when we first started this show was to talk about open fabrics enterprise otherwise known as O fed and We have with us to Are they release managers for O fed or what exactly is their position in the O fed? I think I will defer to them On what their exact positions are in open fabrics and open fabrics enterprise distribution But yeah, I apologize to our listeners. We've taken us a long time to get to this one and I Don't think we even have a good excuse. It's just kind of taking us a long time to get here Kinds of things I even mentioned open fabrics on my blog periodically as well to have to get the mandatory mention of the blog in there Well, I know we heavily rely on open fabrics To support several different vendors equipment at the site I work at but let's go ahead and introduce them I'm gonna have trouble with some of the names here, but we have a Zipport Korean from Melanox, and she's actually located in Israel at Melanox HQ And we also have Betsy Zeller who represents Q logic Zipporah, how about you go ahead and introduce yourself a little bit and tell us a little bit about your position in O fed Hi anti-poet Korean a senior director at Melanox technologies and And as part of my responsibility in Melanox, I'm responsible for the Linux driver in general and the off-ed stuck in particular and I'm actually doing the release coordination of off-ed since 06 and now and from last year. I'm also the co-chair of the EWG working group and This is actually what I'm doing in the off-ed is merging the releases and of course one of the people working with me a lot on this Software stack is Betsy Zeller from Q logic Betsy, why don't you go ahead and introduce yourself also and how you are involved with O fed? I'm out here in Mountain View, California I'm the director of software at Q logic responsible for the Infiniband software stack for our HCA product Been there for seven years since we were originally pass scale and then we're acquired by Q logic Support and I have been working together on O fed since the pretty much the first meetings of it when We can the Infiniband community decided that what was needed was an shared common open-source stack for Infiniband and that was back. I think in 2005 support. Do you remember? Well, I'm not sure if it was 05 or 06, but I think first release was in 06 You know, I think you're right it was 06 so it's been it's been quite a while that we've been working together Before we continue Zippard, I apologize for but hearing your name You've told me a number of times and I still can't quite get it out, right? But thanks again for bearing with us Well There's one from Israel. I'm used to it Could you guys go ahead and explain what O fed stands for it's open fabrics enterprise distribution and the idea behind it is to take a The code that is developed my different companies as an open source effort and Make it a very distribution that will be good for the enterprise. That was the main idea behind O fed and I think Actually, it became a real success story in a way because today they offered the acronym and is used the widely and What's more important is that the software is used the a lot in many clusters and by many users Back in I guess in Sonoma in 2006 The year there was an open source open fabric stack but Every customer was drawing from a different Version of it because there was no organized way to present the software So a developer will be working on Oh, let's pick, you know, you dapple And Somebody would want you dapple and they would pull down that night we build and then a different customer would want a different You know would want it on a different day and we'll pull down a different night we build You dapple is maybe a bad example because it always did very well at day and releases But other software stacks were being pulled down by different people at different random times Who would run into bugs? It wasn't it wasn't a very organized situation and our original goal was to deliver One set of software that had been where all the pieces had actually been tested together and it was delivered as a unit So that was the purpose of starting out with this and it's the part said it's actually had enormous success Maybe more than either of us thought it was going to have But people, you know have really latched on to the open distributions So this is actually kind of fascinating that you know open fabrics as as an organization Ties together a bunch of different, you know Other organizations for example Melanox and Q logic and Cisco is even a member as well So I have to say a little aside here This is one of these interviews where I'm both an interviewer and an interviewee because I've been working in open fabrics And with Betsy and Sipred for quite a long time now, but for the most part I'll try to wear my interviewer hat here But I wonder if you could give us a little bit of the background here because it's a fascinating blend of of open source and competitors because Melanox and Q logic are Competitors in the in Fit-A-Band space yet you guys work together on on a common code base So could you give us a little bit about how that works and how that dynamic generally works? Sure, I'll take this one to start support But it isn't it is a really interesting situation because there we are arch rivals in the you know in Fit-A-Band space and You know the short term of it as it works out incredibly well very very early on When we were working together and support and I you know said to one another is that you know our companies You know our competitors, but our job in this instance is to help make the open fabrics enterprise distribution Release is be successful. And so we've been working together on that and it's actually been great You know support and I try to get together You know at any conference that we're both at to compare notes and talk about what we need to do For the you know for the project and for the software staff and try and you know back each other up and fill in for each other But it was I think really important that we both decided that what mattered was for us to make the project successful Yeah, I I totally agree here with Betsy And I think on the first meetings and also when we had the our first face-to-face meetings We said that working on the software stack We'd be always a collaborative effort. We try to Help one another because our target here that we have a common in the good software stack Most of components of the software stack are actually the same But of course are different parts regarding each hardware vendor so regarding each hardware vendor everybody try to do the best they can and And But petition behind us when you talk about the software here Okay, so let me ask a follow-up question on this so open fabrics is more than just OFED itself There's a whole pile of working groups and so on and in separate I think you made reference to one earlier. You said EWG could you define what EWG is and what they do? Well, EWG is an enterprise working group and actually the main things are doing is the Linux OFED stack By the way, there is another working group for Windows that providing the windows Offed for Windows software stack as well There's also other groups like marketing and there is interop That they're doing interoperability for the open fabrics the software stack and of course of the hardware as well And I don't remember. I think they're also a legal group, but I'm not sure about this For the rest of the show, I think we're going to focus on the OFED stack and EWG Which I think is the most interest to our listener group if not, hopefully they'll Get a hold of us through the nomination form and let us know that they want to hear about some other things in the future But from gone here, you both represent Qlogic and Mellanox. What are some of the other members of the OFED working group currently? Intel is certainly there I'm going to go blank on this Intel's there, IBM's there You know, there were companies are there and I can't remember their names offhand, but maybe support it can Chelsea Oh, yeah with the Chelsea Oh, there is Intellectually has you can say two heads in the EWG. There is a group working on the Udappel and in general management Utilities and there is now with IROPS. There is a they have their own low-level IROP driver And we there is a Voltaire Who is also participating in? In several components and in testing as Betsy mentions IBM and of course as there is Cisco The term mainly is Jeff with the open MPI and also For some time to time the OSU as a MPI and VAPH implementation actually also participate in the EWG today So you mentioned a couple of network driver manufacturers I guess it's not Impossible, but most people when they think of OFED they think of InfiniBand and a lot of people you mentioned there don't actually make any InfiniBand equipment. What is their interest or does OFED support? Equipment besides InfiniBand So there are IROP vendors Which is which are sorry Chelsea Oh and Intel currently supply IROP and they're interesting actually that's RDMA technology will be available on their hardware as well. Since OFED is not only InfiniBand it provide RDMA interface for Applications or for libraries to work with This is one kind of interest. There are other companies that are interested in helping qualification Of the of its software stack since it's part of the product so this is another kind of participants and So these are the main Companies that actually actively participating in the EWG effort of having this release So you mentioned RDMA Technologies, can you define what RDMA is and you know, how does OFED implement that? Maybe go into a little bit about operating system bypass and these kinds of things So in general RDMA is remote DMA. It's mean that you can copy Memory from one system to another system using the without actually Interfere with the CPU working on each system but The NIC or what's in InfiniBand it called HCA actually is doing this work and By doing it it's provide Actually first thing I think is the CPU offload instead of the CPU dealing with copying All the buffers from user space to kernel From the kernel later on to the NIC that is going to copy it to the other side Everything is done by the HCA itself This is the first thing that is important the other thing that is important is the kernel bypass meaning user space application can work directly with HCA Without doing any context which without The need to go to the kernel Like in for instance the TCP IP stack and I think this is Thanks of Vend or ARAP or the argument technology in general is this kernel bypass Of course in order to do it the application must set some Sorry obey some prerequisite of doing memory Traction and using Specific API that provide the RDMA technology or in the case of MPI actually the MPI layer implements this RDMA technology and the application above are not aware of the low-level RDMA that is being done using the MPI But I'm sure Jeff can explain this Actually in a better way Maybe I know a little bit about that. Yes But you stole my question from me I was gonna say what do some of the upper layers actually benefit from this so like how does you know why? Why do we write MPI is you know to these lower layer? This is a particular API called verbs. So it's not just plain vanilla sockets But you know how are some of the benefits of verbs evident to things like MPI and and other upper layers one thing I must say is that everybody can agree that verbs are not easy to write with and I think this is something that Was not improved enough And maybe something that will happen that will have some I don't know easier socket RDMA API For now what we have are the verbs that are defined were defined actually the idea of the verbs was defined in the I be spec and The verb actually provides a means for the user space or kernel space application to Sorry To run the infinite band or are up using the card and It enables the application to have for instance a reliable connectivity one of the all Main thing is that the reliability is done here by the protocol and by the hardware and not by the software so assuming you MPI or any other ULP needs a Reliable connectivity it can open a connection that will be reliable and Unless it get any error it ensures that anything that is sent from one side will get to the other side So this is one of the benefits is that the verb layer provide I think is their reliability Another thing is actually that it provide a very low latency We talk about even sub a microsecond the latency for Small messages to get a message from one side to other side from one hand and from the other hand We have a very high bandwidth for larger messages Is it they can I think today late latest? I don't remember actually Jeff. Maybe you should say what's the best performance we get today The number yeah It's suffice it to say that you can max out that whatever your PCI Bus is because with QDR infinite band. I think you can get pretty high rate I have not tested QDR infinite band myself to know Exactly what those numbers are either embarrassing. Yeah. Oh, I have to look I I guess I have numbers on well Well, you made an interesting comment in there saying that you know, it's a hardware offload of everything But you know both Mellanox makes an HCA and Q logic makes an HCA and Betsy Q logic kind of went the other direction with that Could you describe how you know, what is kind of the the technological difference between your HCA's? sure You know way back when we started Pascal We're working on this back in 2003 And you know, this is before there was much Software available and open source and what we knew people wanted to do was to run MPI Really really fast with a really low latency. We took a different approach We designed a chip that was specifically targeted toward running MPI rather than verbs and You know that honored the physical layers, but not You know the hardware itself didn't handle It didn't violate but it didn't handle various aspects of the IBT a spec. So we wrote something We implemented this So that we had a user-level software stack that talked directly to the chip We've got about 11 patents on the on the chip itself to run MPI through really fast um eventually Verbs, you know caught up to this you're asking about performance numbers on For bandwidth, I think either verbs or MPI over PSM and up with about 3.6 gigabytes per second unidirectional Bandwidth and both of them are around one my second latency zero packet zero packet size And But we took our approach to to target directly into high performance and also to avoid some of the issues with Scaling on infinite band Because the number requirement of number of qp's increases exponentially as the Um, Betsy you were describing that qlogic went the route of making a almost an MPI specific Uh infinite band adapter now what kind of equipment so I know of people running say luster On iB where luster supports infinite band and uses OFED for moving the data there So that's an example of a non MPI Use of OFED what some are some of the other non MPI Maybe hpc related maybe not hpc related where you see OFED being used So I should say at this point that we do have a full verb a full iBTA compliant verbs implementation On our product now it's but it's done in software So that you can do things like run luster run ip over iB and run all those other good things that one gets from From the open fabric stacks And other things include um Other things that run over verbs in our of interest include srp, which is a storage protocol Um, not many people are using sdp anymore now that ip over iB connected mode Um, you know get to the good bandwidth. So ip over iB is used as a replacement for tcp Uh, you can communicate over that uh u dapel Which is used by intel mpi u dapel also runs on top of verbs Uh, let's see what else is there Oh, um, there is a rds From oracle that gives what they call reliable datagram sockets This is used today in oracle's Exadata and as far as I know give about 10x performance Versus other Oracle database and they think this is actually the first Uh, edc market applications that is using off-head and the verbs layer By the By the way, also regarding u dapel iBM db2 is also using u dapel also something that is not HPc related So there are some non-hpc My applications starting to use off-head and I know also that the part of Some customers or from the mailing list sometimes we can see that people that actually Low latency is important for them or very high bandwidth or especially if they need both You know so they Even writes their own application for verbs sometimes we get questions From customers in different companies or in universities for research That's the writing the their own application using the verbs things that are There are let me ask a Logical progression question from that then so so verbs has been around for a couple years now It's getting pretty mature, but it's also Growing and and changing so let me ask each of you betsy. Let's start with you You know, how do you guys see verbs changing or off-head changing? So I know there have been some mpi specific editions and requests and whatnot But you know in terms of mpi is a very obvious target But also it seems like off-head wants to grow into more than just the hbc and mpi market betsy What do you see? off-head and or verbs doing to grow and expand and adapt to changing needs in in these times You know there does seem to be a real interest in um, incentive and for the enterprise market and things that matter for the enterprise market are You know reliability failover The ability to keep your system up and running but to maintain that high bandwidth and low latency um, you know, we've all heard about the Wall Street as kind of you know a new field of high-performance computing and It will be quite interesting to see How wall street? evolves its use Of the open fabrics software and the infinite band software, you know They really really really really really care about that last 15 and a second of latency especially on ip over ib or What they used to be interested in was stp, but that's no longer so interesting So they just a clarification for everybody who was listening betsy you really did say 50 nanoseconds, right? I really did say, you know the number of times one of the engineers in my team has Heard me tell them to go back and get me back that last 50 nanoseconds of latency We finally have just put a time at the end of the release where we make sure that Whatever was required to get that back is done Um, you know low latency is really important As is high bandwidth and you know as is a good shape to the performance curve um So separate let me let me throw the same question at you. Where do you see o fed going and and verbs growing and Uh, even in mpi kinds of realms because I know you guys have done some interesting things there Yeah, so I think Uh, I agree with betsy regarding the financial market and I'm sure maybe More different type of markets Will follow And I want to say that in a way We did a different direct way from qlogic qlogic started from mpi only and then expanded to support all verbs We started from supported all verbs And then we realized that there are some special mpi capabilities That we should provide to help mpi to be more effective for instance things like maybe a collective offload For example, or maybe other Other tasks that they currently done by the mpi layer or done by the cpu Is something that we can now Have offload in our hardware so for instance another thing that We can do for instance is learn maybe another market that actually also related to the hpc is using the new GPUs I think everybody knows that in big clusters The new graphical accelerators actually being used to do some Computations So we have to learn how to actually Use the the gpu's as well, you know, how we can combine them with infinite This is also a you know new directions that Still related to the hpc market, but we are approaching We also and other things that Is interesting is maybe I don't know, but maybe one day it will come that we will have some simple API to to use the rdma and We hear this request for a long time We haven't started working on this, but this is something that you know, we think about so maybe one day it will happen and Supports right the gpu technology is extremely exciting and you know customers are very interested in that So i'm i must say said men moving on to some other stuff ofed We install it using what red hat provides us and ofed's pretty Complicated somewhat confusing compared to dealing with ethernet And other networks that we've worked with before Is there a definitive source of documentation or just a Ofed for dummies book out there somewhere that kind of describes low level driver user space driver um kernel driver There is not Documentation up there on ofed. I think that's a great opportunity for somebody There is documentation, you know as part of the ofed download in the docs directory that talks about how to do the installation And talks about the various components in the software Um, I don't know when you get that directly from red hat if you see that documentation or not So what is the recommended way to use ofed then should we stick to using what red hat provides us? Or should we be trying to use what comes off the open fabrics website? It really depends on what your needs are Taking up from red hat. It's got the great advantage that red hat's packaged it up. You just install your system. You're um able to honor your you know your support license with Red hat because you're not making any changes on it. However Sometimes the red hat uh installations are rather behind the software that is available from the open fabrics website because Obviously red hat has to wait till we're done with it and then you know get the right release vehicle to put the changes into their system so Either either one will work if you want the newest stuff you want to get it from the open fabrics website or from you know your vendor site But if you want it packaged up and you don't mind the fact that maybe there are bug fixes available that you don't necessarily have in your red hat release because it's Based on an earlier version of the software Then uh, you know red hat is probably a good solution for you um Yeah, I I agree with betsy. I also want to mention that also novell with this less also have inbox support of open fabric software and also I must say that Each vendor I'm sure I know q-logic has and we have and the volterra has their own we have our own you can say Distribution of uh of offed That actually is already provided for instance in binary rpms One of the things with offed is we provide source rpms Which mean you have to compile everything yourself And you have to you must have a kernel with all the A kernel sources installed and the compilers and all this Which I understand is not always easy for the end user for the it user so We provide on our website a Binary for a selected the last Distribution that actually needed for the customers Because as betsy mentioned If what's inside the distro Is good for you. So this is actually the best But maybe you want to have you know, maybe there is a new hardware or a new major feature or a critical bad fix That is not there And maybe it will be in the most greatest and latest of the distro But you are still using an older version of the operating system I think one of the major benefits those offed is what we calling the support for older version of the distros And we have what we call the back ports. We have a way to support older for instance today We're that five update five is latest from reddit, but we still support reddit five update three So if someone wish to stay with the older operating system, but takes the latest and greatest feature and fixes It's possible to do is often So that's a good lead into the next question here. So I see on the open fabrics website that Oh fed 1.5.1 Is the latest current stable and one that 5.2 is in the works and I've seen a couple of emails discussing What we should be doing for 1.6 and so on so could you tell us what some of the New features are that are are coming in in future releases of open fabrics at least in 1.5.1 I think one of the important New features that was there is actually The what the rocky technology Which I think maybe today it's just called roy roe. I'm I'm not sure what should I say But there is a spec called is a rocky that actually enable The regular in philip bent verbs working over a regular way internet So this is one of the new feature In 1.5.1 and actually this technology is these days starting to be integrated by roland That we haven't mentioned him, but he's actually the rdma subsystem maintainer in linux kernel, which we work a lot with And I think we should give him some credit in this talk as well So actually this new technology is now being integrated into the linux kernel So this is for 1.5.1 as long of course as the bug fixes and the changes That we are doing and supporting more operating system and in 1.5.2 I think At least from what we are working is actually we work very hard now to improve the sdp We have made added sdp zero copy Which is very good for CPU utilization and large measures and we're working to improve the sdp stability This is something that they should be in 1.5.2 And as usual we will support the latest distro like we're at 5 update 5 Whenever less 11 a service spec one will come out we support These two And I said separate quick quick clarification cause you're saying there was some nice improvements to sdp Could you say exactly what sdp is? Oh sdp is a direct socket protocol. It's defined. It's a protocol that is defined in the infinite spec of how to provide a tcp socket Implementation in the kernel that actually Bypass the tcp Kernel stack and doing it using rdma or just technology Betsy, maybe we wish to add I know you also had some things That you wanted as a new things from Qlogic in the coming releases That's right. Um, we're submitting our PSM layer performance skilled messaging layer, which is the user level stack That gives us our high accelerated MPI performance and we are releasing that to open fabrics with the 1.5.2 release On and we're quite excited about that um The other thing that's kind of interesting Is that uh rdma over ethernet if i'm correct on this is actually based on the software implementation that Qlogic did To implement verbs In software and I think that that ended up migrating to be used as the basis for rdma over ethernet is um Is that accurate support Well You see some things that also can be used. This is uh, well, we called it in a way soft rocky This is if you want to use the rocky using any any hardware The rocky implementations that uh, at least melmox provided is actually running on our hardware. We have the hardware is doing it So But you know the other was just an interesting example of how open source works You know that this software was released for one purpose and ended up migrating to you know Kind of a different purpose for a different company And that's one of the good examples of open source Yeah and I think if we look more for the future for things like one six So I guess say we start to see things like a kvm support with s r i o v Which also is a new standard For supporting A lot of virtual machines. This is something that currently We don't have enough head maybe more scalability solutions Today the cpu's are multi core. Sorry Yeah cpu's are multi core and we see systems with more and more Cores and we have to see how to address this kind of New technologies coming Maybe improving things in the management, especially for next I don't remember how how many petaflux clusters they want there or But the more the better I think yeah, yeah, but the large scale of Of the clusters is also some give us some challenges in the infini band and how to manage this network So what are some of the explicit uses of OFED that may be high-profile me? We know that we build MPI stacks on top of them storage transport um, what's been some of the Feathers in your hat sort of of OFED enabled this Let's see Well, it's um OFED Infini band whichever way you want to look at it, you know the government labs Use infini band and You know they do interesting government lab work on it climate We get a kick out of our red bull as a Partner, you know, we would you know listen to I had engineers on the team who would listen to their races and then you know Talk with the red bull people on Monday to help tune things up for them Oil and gas is another interesting use you know doing exploration for for for oil Climate control, you know, there's um There is one institution that is you know attempting to you know Analyze all of the weather data that was gathered before Hurricane Katrina to see if they could have actually predicted hurricane Katrina beforehand or its intent I obviously they predicted it, but it you know predicted some density and what happened with it So it's a lot of really it's one of my favorite things about working in this space Is that there are such interesting problems being worked on and sold? and Also, I think we can mention that If you look at the top 500 list of Most fast Clusters, you can see more and more infini band clusters there Especially also in the if you look at the top 10 a top way 100 There are many many and more coming things like Some name I remember like the world runner or I don't know if Jeff remember name of more a Very big cluster today. I think we work in china in very interesting very large cluster called donning and So I think You can see by the top 500 that And most of or I think maybe all of the infini band, but I'm not sure I'm actually running the Offered the software stack coming from the distro or coming from offed Um, okay, so that's a that's a pretty broad spectrum of usages there So then what is the licensing situation with open fabrics? I think it's kind of interesting. Why don't we spend a minute or two on that? Okay, so open fabrics when we started the doing the alliance We decided that the code should have dual license, which is gpl a choice of gpl and bsd And the reason for it is the gpl was important in order to get to the linux kernel This was the requirements and to be part of the distros From one hand and from other hand some companies want to take the software And change it enhance it to their own Internal usage and not provide The code again as required in gpl And then they can choose to use it as a bsd and only give the copyright as they're required Okay, well, thank you both of you. Um, could you give some information about where to track down offed? website mailing lists www open fabrics.org And there's user mailing lists where they can get help and things on there anyone who's trying to use offed Okay, well, thank you both of you this show will be up soon on rce-cast.com. Thank you. Thanks