 Welcome to another edition of RCE. I am Brock Palin. You can find the podcast online at RCE-cast.com We're also in the iTunes feed, but there's an RSS feed and you can find all the old shows there also. I have again Jeff Squier from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks again for your time. Hey Brock, how's it going? Good. Yeah, so again, we're doing another December recording show here in the the doldrums for the HPC world since it's after Supercomputing and before the holidays. So getting another one in, squeezing it in so we can get in by the end of the year. Yeah, and so all of our usual contact information you can follow me on Twitter at Brock Palin and you can find those all linked off of the RCE web page as well as my new blog and Jeff's blog, which is actually kind of what got today's topic started. Yes, it is. So I actually saw an article pointed to by one of the IBM Twitter feeds, soft talk blog, an article comparing, well, it wasn't really so much comparing. It was talking about MCAPI, the Multicore Association Communication Pro. I got that all wrong, but we'll get it right in a minute. And MPI, and so of course that tweaked my interest there. I went and read it. It was a very interesting article. Had slightly different takes on things that I did and so I actually wrote a response to it and there was some good discussion that came out of that. And I said, you know what? Why don't I just contact these guys? We should talk because there's some interesting possibilities here between the MCAPI and MPI. And so we said, you know, actually, why don't we even follow that up with a podcast? So here we are. We got two Multicore guys on Skype here with us. So Sven, I wonder if you could give yourself an introduction. Sure. Thank you. I'm Sven Brammer. I'm the President and CEO of Polycore Software and we provide Multicore Solutions, and I'm also a founding member and of the Multicore Association as well as the chairman of the MCAPI working group. Hi, I'm Ted. I'm Vice President of Sales at Polycore Software, and I run the sales and business development for Polycore and I've been with Sven many, many years at different companies. So we are interested in talking to you about MCAPI and MPI and what would you like us to say here? Well, actually first, what exactly is the Multicore Association and what is Polycore's involvement with that? Sure, I can take that. So the Multicore Association is an industry group that was created to enable the Multicore ecosystem. So it's, I think it started in 2005. We, Marcus Levy and I invited people to a meeting to talk about Multicore and the turnout was way beyond our expectation. And that's how it started and then we discussed what can we do to actually help the Multicore community and we decided to focus on standards and guidelines. Yeah, that's about it. So who all is involved in the MCA, the Multicore Association? What are your members like? So it's actually a broad group or it spans across the industries. So we have semiconductor vendors and we have software vendors. We have EDA vendors and we also have some of what I would call the consumers, not being the consumers like you and I, but companies that would actually be the beneficiaries of using the standards. So it's a, and then I think we have about seven universities. So it spans the industry and it spans academia. So do you mention that one of your output products is standards? What kind of standards do you make? Software standards or hardware standards or what? It's primarily software but I think it touches on the hardware in multiple places and the first one that we defined was MCAPI which stands for Multicore Communications API and it's now out in its second version and we also defined something called MRAPI which stands for Multicore Resource Management API and we are currently working on M-TAPI which stands for Multicore Task Management API and then there is also a working group working on virtualization, another one working on tools infrastructure and we also have one working on a Multicore best practices programming guide and and if you look at the MCAPI, MRAPI and M-TAPI, they are basically the foundation standards of the Multicore Association and you can use them individually or you can use them in combination and if you use them in combination some of the concepts are common between them so you you can get some benefits that way but they can also be used separately. So we actually want to discuss MCAPI and its components individually as a technology of what is it and what does it aim to accomplish? All right, so it's again a Multicore Communications API and it's meant to be a light or allow lightweight implementations primarily focused on embedded systems and if we look at the API itself, it has a number of functional groups. It deals with node and endpoint management and a node is defined as a thread of execution which can take multiple forms so it can the intent is that the an implementation can run in a full-fledged system with a process model or down to a bare metal system that doesn't have an operating system and there are also three types of communication functional groups. We have connectionless messages. We have connected packet channels and scalar channels and for messages and packet channels they come in blocking and non-blocking versions and we also have functions to deal with non-blocking operations. And the it's also important is that the standard was defined to be agnostic to the number of cores, type of cores, the type of interconnects and the type of operating systems or as I said no operating system so it should be able to implement it in a very broad range of systems. So then does it require some sort of hardware support like you said focused towards embedded systems. If I was using this between units of a couple of FPGAs or something what would be the requirement to make this work on it? Yes, the standard does not imply any specific implementation but certainly implementations need to to deal with with the hardware and I can just exemplify from our we have something called Poly Platform which is a programming platform for multicore and we have support in Poly Platform on the runtime side. You have to deal with of course if you have different operating systems how do you actually manage the communication when you have different operating environments and the same for the when you look at the interconnect so of course many systems have shared memory and in that case you have to deal with shared memory and find the most optimal way to communicate through that medium or if you have something like Rapid IO where you have to move data across a wire you have to deal with that. So that is left to the implementations of MCAPI and we did not specify in the standard how to deal with that. So that is interesting so you deal you know by the name one would assume multicore is within a single shall we say server but you said across the wire as well does the MCAPI also allow for communication say outside of a server across even something like Ethernet? Yes but with some some qualification so the target space that we are addressing is what we refer to as closely distributed multiprocessing and that generally would be a multiple course on a chip or multiple chips on a board or any combination thereof. We can go across a wire but in our target area we assume reliable transports and if you look at MPI you you can certainly have a much more dynamic type of environment and we assume a more static type of environment but you can go across a wire so we have drivers for running across TCP IP but then you have to deal with transport maybe being there or maybe not so that adds a little bit to beyond what we have defined in MCAPI. Well just one to add one thing we also run across the wire where you may think of SRIO is being across the wire too which would be more of a closely coupled system than across TCP IP. So we're talking about communicating between machines because we're instantly getting back because our business at Jeff and I work most time as high performance computing and we're generally concerned with communication between large numbers of cores at what level does NCAPI on a normal system work though when you're just between cores is this like a does it look like threads does it look like open mp or does it look higher than that? Again it can actually take it can come in multiple shapes so we defined I know that's a thread of execution which could even be actually a hardware accelerator if it can execute a thread of code. So it depends a little bit on the implementation and in some cases you will find that it would be the equivalent of a process so if you look in a general purpose OS like Windows or Linux it would just be a process whereas in a real-time system it may be a core and in some implementations it could be a thread. So there is no one answer to that question but but I think one thing is I would say common for all those types of implementations is that they will touch the hardware at the interconnect and also you need to have some kind of inter-core signaling or interchip depending on if you're on on chip or off chip. We might say it looks closer to MPI in the fact that it's message passing versus some of the other standards that you mentioned and that's what we look at M-CAPI as being an extension of MPI but shall we say the last set or last mile of MPI it runs down in the very end nodes of a high-performance computing application. At least that's how we envision it. We designed it for a closely distributed computing so when you're in multi-core or multi-core as Sven mentioned earlier accelerators environment you can think of accelerators being an array of DSPs or an FPGA. We're in a heterogeneous environment at this state and then that's where open MP is as well as we understand it doesn't work as well so M-CAPI can run down on the heterogeneous side where you might be running Linux on on the main core or general purpose core and then running M-CAPI on the Linux piece as well as on the the arrays of DSPs or FPGAs. Let me throw in one extra thing here too that MPI actually defines things similarly that all of MPI communication is defined in terms of an MPI process but the the name process there is a little bit of a misnomer because it's just whatever the implementation defines as a process. It could very well be a thread. Back in the early days of MPI there were implementations that did MPI processes as threads but those have kind of faded out and almost everybody is kind of standardized on MPI processes as actual operating system processes but just that seems to be a similarity at least between our two standards if not necessarily implementation wise. Okay it's interesting. So what about so it sounds like you're kind of almost closer to the hardware like it really is this embedded space or things that look like an embedded space where you've got different types of equipment you're going between while MPI traditionally focuses on more of a it traditionally operates an environment that's more homogeneous and Jeff will probably interject with other examples but practically that seems like what happens a lot. So could this help me avoid like a CUDA malloc CUDA copy stuff like that kind of thing like this would be the standard way for getting and doing stuff on a GPU or other types of accelerators? Well we are looking at GPUs to be honest with you but so far we've been our implementations have been focused more on general I won't say general purpose but DSPs FPGAs and main CPUs. I think yeah maybe it would be in a sense like you said because we're all familiar with the message passing paradigm and MPI uses message passing to move data and so does MCAPI and so from that sense our programmers are knowledgeable in the message passing environment it can use the same compilers and tools that you're familiar with it just adds an environment where you can move from homogeneous environments to heterogeneous environments and we mentioned I did mention heterogeneous environments in our last piece here MCAPI does work well in an SMP environment too a homogeneous environment it's just that this model allows you the MCAPI model allows you to extend past the homogeneous SMP environment into a heterogeneous AMP type environment and that's where we see the extension to MPI. And what do you define as AMP? Asynchronous Multi-Processing system which is typically you might have let's say Linux on one side and maybe a real-time operating system or no operating system on the on the receiving side or you might call the hardware accelerator side. Okay so oh go ahead why don't you go? No I was just going to say the broadly we'd look at AMP as it's not you shouldn't equate it with homogeneous or heterogeneous even though there are you know connections but we look at it as if you run an application symmetrically or asymmetrically if any core can do the job or if you if you assign certain cores to do certain things. So your use of the word core is not just a traditional x86 type of core you're talking any kind of core like FPGAs and DSPs like you mentioned right? That is correct yeah. Okay all right well we've been batting around this a bit we've been talking a little bit about MPI and MCAPI let's jump right into the heart of the matter what was the this white paper that you guys published what was its message? No pun intended. So the intent was we've had some conversations with with various entities who who were asking us about MCAPI in in systems where they were using MPI and I would say the the driving force for the conversations were in systems where you may have had before a few cores on on on the same chip and next generations would go to many cores and where you you didn't need to have the the richness and robustness of MPI on every core but you may want to use them as accelerators and have something more lightweight that's kind of what started the conversations and then the the article or white paper was published basically to propose a different a few different ways that we could see this be done and we wanted to I guess start a debate between the the MPI community and MCAPI community to see if there are opportunities so it was kind of stirring the pot a little bit and I think that's why we're here talking so when you're looking at using these technologies together or you know when you're evaluating an application do you see MCAPI's interaction with MPI being more as a competitor or under the hood of MPI or an application developer would use MPI and MCAPI on the same code base well at first glance it may look like they're overlapping but I I don't really see it that way I think that they are using the same programming model but they have some different characteristics as we discussed before that I think would allow you to use that same type of programming model in a broader scope of systems and so I see them as being complementary and I can see MCAPI as I think Ted described as he said the last mile another way to describe it would be that you say that MPI is in the branch and in the trunk and the branches and and MCAPI would be in the fingers where you just use MCAPI on an MPI or with an MPI process to farm out workloads on other cores so that's I think how I see it but I'm sure there you it could be done in other ways so then do you see the strengths of the MCAPI API being more smallness and elegateness and things like that they can fit into an embedded environment versus an MPI which is a heavier weight with more bells and whistles and things like that yes I would I think that's a good characterization I think MCAPI is targeted at systems where you may have severe resource constraints and it may again have a variety of I guess you could say heterogeneity at multiple levels where you have operating systems you have types of cores and you have different transports so I think it could be and again at the at the edge of SNX and provide acceleration for MPI applications it just one that when the Multicore Association started and MCAPI they were targeting embedded systems developers and and so that's what for the reason for the definitions that Sven gave earlier but as Sven and I have been working through this we've had conversations you know with embedded systems developers of course because that's where we're targeted but and the reason for this conversation is we've had discussions with people or developers in the high performance computing world and interesting enough we're also having discussions with people on the desktop slash server world so I think what's interesting in our marketplace here is that we're all struggling or looking for the same solution for this problem of how do I deal with Multicore and the final edges here and for how do we deal with things where cores get denser and denser or there's more cores on a chip and when we believe like you said message passing is the best way to go here because it scales indefinitely and so that's all I have to say on that part right now okay well actually let me ask you a question about message passing then because I'm an MPI guy obviously and that's my bailiwick as well common detractors of it say that it is ultra complicated they liken it to the assembly language of programming and why aren't we using higher level concepts rather than plain vanilla message passing and things like that what do you guys say to that well I look at it this way so why are we having difficulties with Multicore and I think there are multiple reasons I think a fundamental one is that the software environments that we're using for the most parts were built with a single execution unit in mind and so we have sequential languages and tools and I think also the I guess with some exceptions in terms of you our vision is very parallel but I think our thinking processes human thinking processes often sequential and so I think we have a hard time realizing what's going on when you have a multitude of things happening at the same time so I think ultimately we probably would like to have higher level concepts but reality is that we are we have the things we have now and I think that we will continue to have to use the tools and environments that we have today for quite some time even you know we I think MPI has been around for quite some time and M. Cappy we're starting to to not be so young anymore and it takes time for standards or new programming paradigms to to be defined and to to to proliferate in the market so I think we need to find ways to deal with things now and and for the your next several years if I could take your question and just spin it a little bit you know you talked about new programming models or paradigms and I think they'll always be evolving but I think one of the ways to approach the problem too is you can add tools to the environment to simplify the programming piece and we see some tools in the modeling area too that help us do this so it's not that we have to go back and relearn new languages to programming but if we can provide tools to bring make it easier for programmers to move applications into multi-core that should solve the problem at this stage and we can also give them a way to move quickly forward and that's where Polycore software focuses we have M. Cappy support and a tool called Poly Platform which is the tools and run times piece then the tools actually simplify the development where they can enable their program with M. Cappy but just through all graphical interfaces don't really have to write any code and then they can lay out the topology of their M. Cappy using graphical tools and all the code is generated for you when the majority code is generated for you so here you can move into a multi-core environment without really becoming a multi-core expert learning all about M. Cappy and I think that's where your paradigm from MPI comes back it's like it's so hard to learn so if we had tools for MPI that would make it easier to use and learn I think that would solve the problem so interestingly enough your answers actually are quite similar to that in the HPC and MPI realms that you know MPI actually in the introduction to the standard says we fully admit that we're middleware and we highly encourage upper-level tools to be built on top of us and that is kind of our typical response is that yeah there's a lot of applications that write directly to MPI but there are actually quite a lot of middleware built on top of us or even applications built on top of MPI that the user has no clue they just say solve this FFT a miracle occurs and MPI just happens to be used underneath but you know from their perspective all they did was issue a serial call and magic occurred to make it happen in parallel yep I think one other thing to add to that I think that you know Ted mentioned that message passing scales and I think MPI has proved that because there are systems with many many cores I think the other part that I think is appealing is that when you when you pass a message you're handing off to another entity which makes it somewhat easier to deal with synchronization issues so and plus that it's a passing messages communicating is something that's been done for a long time in programming so looking at a traditional system right now where we've got 12 cores in a box but you know looking forward we may have you know let's say 48 cores in a typical node and you see MPI kind of running on everything and then and CAPI inside the box to get good performance who do you see actually producing the mCAPI implementation would it be would it come with your MPI stack or would it be provided by your hardware manufacturer that then MPI would plug into that's a good question well as we said we provide the tools in the runtime and we were working with silicon vendors right now for enablement on on their parts I guess it would be ideally it would be best if it ran under the covers just the way MPI does I I think there is I don't I can't say that I can think of one one way but I think it would make sense that a semiconductor vendor either provides their their implementation they work with a vendor like us and and also provide an integration between MPI and mCAPI and as we had discussed before Jeff that could be through a bridge where you look at how do you go from MPI to mCAPI and in the other direction and kind of routing routing between the two message passing layers so to speak yes and so so I think you know we look at semiconductor vendors as our partners because you know it's one thing to provide a a middleware that can do something but if you want to do it really well you typically need to work with the hardware vendor to to you know find a an optimal implementation yeah Jeff I think right now we're typically working with the application developer largely largely because it isn't an embedded world here but so in this case like I would guess that the application developer would have to know something about what's down underneath like or or at least the systems architect is putting together this high-performance computing model know what's down know the architecture know what's down in the uh at the end node is we'll call it so it can be configured to run uh make you said the high-performance algorithm and it would be completely hidden from the developer who just calls uh some routine it says go do this all right let's go off in a slightly different direction um after I posted my blog post there were a number of comments put on by admittedly people in the mpi community probably a very little knowledge of mkapi or how it works or anything like that one of the things that they were harping on from your original article was what they interpreted was an insinuation that mpi is inefficient inside of a server could you just comment on that could you say uh you know were you trying to imply that mpi is slower than mkapi when we're talking between x86 cores or you know what what was your intent there because I think that was misinterpreted yeah so no that I don't think we we meant to imply that mkapi was better than mpi in that in that respect so uh we what you could do with um well let me first say this so doing a movement in shared memory I think that was the commentary about shared memory um it really comes down to um you know if you have to move the data you have to do it either with a soft copy or or with a dma and you may have to do some signaling between the cores and I you know you can do that probably equally well in in in both if within both in the standard implementations covering the either standard we I do think that in given that mpi is so rich you may not need mpi on every core in an implementation and as I said before that mkapi could be in the fingers uh I think that's really more what we were trying to say and and um and you may be able to use different types of transport so you know some systems they have shared memory and they have other network on chip type uh interconnects and that we could support that so one of my initial inklings uh was that you know is there something that mkapi implementations do or does it simplified api lend itself to being faster that there's less abstraction perhaps and uh you know less overhead mandated by the api and therefore can operate all that much faster but in the call that we had a couple of days ago before we had this podcast recording we kind of came to the conclusion that the performance between x86 cores of mpi versus mkapi is going to be about roughly the same both of us have taken a lot of pains to to optimize those types of things and really the opportunities for working together are more interesting than layering one on top of the other for the cases that we already overlap is that a correct characterization of things that we talked about yeah I would say so right okay and so for example one of these things that might very well be interesting is uh like we touched on briefly before acting as a router right between mkapi processes on back end fpgas or something like that or or vice versa mpi running on back end things but using mkapi as a transport to get to these dedicated uh compute platforms and things like that part of the problem though is uh there's no idea whether there's a lot of demand for that and so it's kind of a chicken and egg problem until people start asking for it it's uh we probably wouldn't have the resources to go experiment with it yeah i think that's always the problem jeff we're always in the chicken and egg and uh you know it's the question is if you build it will they come i think the the well the reason we are even talking and the reason for the article is some of the discussions we've had with the mpi people who who actually contacted us because there was they they felt that as more cores are at it um the mpi overheads become come into play i can't tell you if that's true or not to be honest with you but um i'll tell you that a lot of people in the empire sorry i have to speak up here a lot of people in the mpi world are very concerned about that and there are on the implementer side there is a lot of effort to make sure that that does not happen that as we go up to 96 and above cores that uh mpi can be just as light and nimble as it is today on shared memory if not more so um i agree uh well i don't agree that that's why we got we got brought into this conversation and the other part was is is it always going to be a homogeneous environment um is there can we should we be adding processors or i'll call them compute engines um at at the very edges of the network that could improve performance of certain types of algorithms right we know we have better performance in dsp's and fpga's for very specific uh arithmetic operations yeah that is a fascinating question in the hbc community there are some very strong components of accelerators of different flavors fpga's gpu's and so on and there are others saying nope too much trouble or at least too much trouble for my application set um because homogeneous is so much easier particularly given the complexity already i just don't want to go heterogeneous but others saying look you build the right abstractions and heterogeneity is not that big of a deal and it's a major win for my application set so i agree with you that there definitely is opportunity to be had there it needs to be quantified um perhaps and and discretized into market but uh that i believe also would be an excellent opportunity point for both of us yeah i agree actually that sounds like the speech we give but i think there's some value to that it's true right as we move forward just don't get so much performance there's so much density right and what are we going to do and the next and the next leg here to improve performance but keep the you know all the all the costs that go into running computers around you know the heat the energy all that stuff so you know is it just going to be 96 homogeneous cores or 256 whatever it is so will they be adding accelerators yeah i think the um you know it's very true what you said about it's too much complexity and what we see is that um you know i would say that the barrier to multicore has been very high in at least in some type of applications people have been sitting on the fence waiting for something to become better and the the utilization of of what i would say acceleration and i say that broadly is only good if you have access to it so we think that providing the the programming models and tools platforms that lower the barrier to entry to use those acceleration engines whatever they are that's going to drive the the adoption because it it's really hard to do some of these things and so people choose the the easy path and until there is a way to to choose another path that's not so hard that the barrier is lower so let's talk a little bit about the some of the implementations event capi so you guys have a software stack right now um what about some of the embedded vendors with the actual embedded hardware support who who's actually got some of this stuff up and running well we're early in the game here but um the markets that are most interested of course is the telecom and networking markets they've been using multiple processors and and some some small level low low density multicores all along and they they understand the problem of moving forward as as well as anybody else so um our most tracks right now is in the networking world but um we also see the people are calling us from all ranges as we talked before but staying in the embedded world uh we see areas of high-speed storage devices or anywhere where you see large amounts of data flowing through and having to be processed so we see them in like radars and military applications we even see some areas in process control where they're not really using multicore just yet but what they're doing is they have multiple boards and they're anticipating the move to multicore and doing more and more in their process control area so they're looking at MCAPI as a way that they can they can uh change change their programming paradigm now and then have multiple boards multiple processors on a board as well as multicores in the future so um we're seeing quite a range too um and as I said basically it's applications suitable applications are basically those that have data streaming and uh we see video and audio applications as being those those types of applications also so what about in some cases like uh there's a number of these large distributed sensor networks that are going out there constantly collecting information that in the HPC world we've typically talked about ingesting the data from those systems but a lot of those systems a huge amount of their overhead is just collecting and transporting that data and they do it using traditional TCP IP right now DC MCAPI going in for these small sensor networks um it's not an area that we've had too much um action in so far but maybe um maybe it's something that we should look at yeah perhaps maybe at the at the edge where there might be some aggregation of the data prior to moving it to uh whoever's going to process the data maybe on the edge that would be a good place for it let's run now real quick back here to the standard side of things what can you tell us what's coming up in the multicore association so you mentioned some of the other working groups and so on is MCAPI itself going through any evolution you said it's on version two are there is there going to be a version three what kind of features are coming and so on yeah so so we are um working on on the next version for MCAPI I I would like it to be a version 2.x rather than a version three and because there's a there's a lot of work going into it and I would like to see more evolutionary steps uh so we're we're looking at areas of improving capabilities for buffer management and zero copy and we're also looking at interoperability and we have a list of things but those are the the ones that are on on the on the plate right now and then uh looking at the other standard so MCAPI is in process of of defining the first version and MRAPI is already released in the first version and currently there's no work going on in MRAPI at the moment um but um and but we will we will have new versions coming out over time and of course also we would want to get as much input as possible from the community using the standard so that's another reason to you know to to to let the version two sink in a little bit first before we move to the next version it was released in this spring so now that we've got a bunch of information about what this is where can people go and get more information or possibly get the spec so there's information about the multicore association can be found at their website which is multicoreassociation.org and there's a dash in between multicore and association and they can also find information about the polycore software products at polycoresoftware.com it's all one word polycore software i think we have some links on the MCAPI website or the multicore association website also yes because for product supporting MCAPI and some of the new things that are coming out that we've been supporting okay well Ted, Sen, thanks a lot for your time sure appreciate your time gentlemen we appreciate your time thank you Jeff thank you Brock