 Welcome to another edition of RCE it's been a little while, but we're back for the new year. Yes, we're back and Again, you can always find old shows and nominate new ones and see what we've got listed at RCE dash cast comm You can also follow me personally on Twitter a Brock Palin all one word and you can also find links to that on the website You can also find a link to Jeff's blog off of the RCE website So that is Jeff on the other end there Jeff has been my co-host now for quite some time has been a great help for this So Jeff, thanks again for your time Welcome to 2012. We're you know comfortably in the second month. We were taking our you know, January siesta. That's what I'm That's my story and I'm sticking to it. Yeah recovering from all the holidays post SC all that stuff It's a it's a little bit a little bit caught up now, so we can get back. Yeah now back into the RCE So, let me let me give a quick intro here to our guest today today. We're talking to shy full-time He's the founder and CEO of scale MP. So shy. Could you introduce yourself? Sure. Hi guys. So I'm shy and founder and CEO of scale MP doing that for the last nine years prior to that I spent a couple of years at the VC industry and prior to that they ran a IT operations and advanced technology research center For Israel Defense Force intelligence All right. Well, I understand that you've just come off about a 17 hour flight So I think it's perfect time for an interview right now. So we'll get you at your absolute best, right? Absolute best with a little bit less oxygen in my brain. Okay Okay, well, why don't we roll right into this specifically we're interested in your the product that you guys offer a scale MP I've actually personally used this a little bit Jeff. I don't know if you have But it's it's an interesting thing. I get a lot of questions about it. So why don't you give us an overview? What is scale MP? So in fact scale MP is the name of a company. Our product is VSMP foundation VSMP foundation is a virtualization software that virtualize multiple systems into one virtual machine Providing you with the access to all the resources within a single context not to confuse that with the virtualization such as Offered by VMware and the KVM and others which takes single system and slice it to multiple VMs Our VMs provide you the aggregate resource of multiple independent servers So I kind of sometimes refer to this as like inverse virtualization How does this compare to other things like you know skilled B proc or something like mosaics? Excellent excellent question. In fact, these are attempts to go and address the fact that the Collection of servers are hard to manage and require a special programming model and our approach to that is to hide the fact that the Your infrastructure has multiple servers. So we virtualize them as single one And now that you have a single operating a single virtual machine running across all of them You can run a single copy of the operating system of the CD readout or should say and You can run your process there and you can go to your process and allocate for terabytes of memory If that's the aggregate RAM that you have in the underlying systems for use mosaics or Skilled or B proc or you name it. You are basically getting a single address space name space for processes or for files But you're not getting a single copy of the operating system And that's what you get when you have a single VM running across multiple systems I know what's the difference between doing this and say buying, you know, a big iron machine You know many core machines are getting More popular these days. I think a lot of vendors offer 64 cores these days And you can even get quite a bit of RAM. What's the difference here? What's the value proposition? I think that they are free first is the price a Large machines are priced what I'm calling in a reverse dog dog food Formula when you go and buy the dog food if you buy one pound of dog food You'll pay X if you buy five pounds of dog food. You'll pay less than five X But when you buy big machines, you'll find out that a dual socket server cost you 5k Four socket servers twice the resources will cost you 25k and eight socket servers will cost you about 100k So you double the compute and memory, but will you quadruple the cost? With our technology, you just use off-the-shelf building block and build the system you want for your workload So that's one price second is performance You're not limited to the type of processor sitting in this high-end machine So for example, if you go to buy four socket or eight socket machine today It would be based on the Intel site on Westmere processor Westmere X If you do that with scale and be technology these days, you get a sandy bridge based platform So you get the latest and greatest CPUs not only that even if you go for Westmere when dual socket machines You can get Westmere machines that what means Westmere processors that goes all the way to 3.46 gigahertz So you get really really advantage in speed Compared to what you get in four socket and eight socket machines Last but not the least is flexibility Let's assume that you already have your cluster. You're running your 200 400 600 maybe thousand node cluster And you just need a large memory source for a temporal use Now you can take a software allocate 20 machine create a VM on an on-demand basis run your workload And when you are done kill it and get the nodes back to the pool So obviously if you're tying all these things together you need a lot of communication between the systems so that they're kind of hidden What kind of hardware requirements are required for this? Any x86 server Is what you need the systems have to be connected with infinity band There is a detailed list of systems that we have certified on our website You can see there are dozens of servers from Dale HP IBM Supermicro Intel others Fujitsu On the website in fact in if you have a server that that's not on the website There is a place to post server information and get from us Get us into a situation that we certify that machine so All in all if you have a door if you have a x86 system whether that's a dual socket quad socket or eight socket machine And it's connected with infinity band. That's you have the basics and go from there So do I have to patch Linux or anything like that to get a special colonel? But like a hypervisor layer and they're like what's actually involved for booting this thing up So it depends what you're trying to do and you can take him any operating system and run it on the aggregate VM As long as you're using the IEO and the memory only of the all the nodes you have If you're trying to boot off the shelf operating system and take for example soon they'll read that on a machine that have thousand cores You better customize your kernel a bit And not a lot of customization a bit of them and we provide them you can download that off our website So if you set up the VM to just aggregate Memory of the older nodes you have use the CPUs of one machine No problem. Take it don't need to customize anything operating system can go as is but if you try to run a high-core count VM think about the hundreds of cores Best for you is to go and customize the kernel and we can provide you that of course the more advanced kernel that you use if you start moving into the Free point two in others. You need to do less customization because the kernel itself becoming smarter. That's great So I'm gonna jump back a little bit a minute ago. You said that you have to use in finnaband Let me ask you why in finnaband and I'm asking from a little bit of a bias I work for a company right that we we used to do in finnaband and we decided to not do in finnaband for a variety of reasons that are Are not too interesting. So why did you guys choose to use in finnaband? So history of scan npl first product was running on internet. In fact, there's nothing specific in finnaband that the Is a must for our product work And years ago figure out that in finnaband is the interconnected give us the highest bandwidth and since our product is Translating basically using algorithms Latency into bandwidth equation. Okay, so we take latency sensitive operations and we translate them a To be less sensitive using a high degree of bandwidth doing pre-fetched pre-fetched and cache of memory between boards and we needed an interconnect that will give us the highest degree of bandwidth and Today the interconnect that gives you that is a in finnaband, you know in finnaband on the center bridge machine is Crossing the 50 gigabit per seconds. That's that's really great Now is there any would there be any benefit in doing some kind of dedicated networking yourself like, you know hyper-transport tight-legged solution or or something that would extend qpi or the PCI bus or something like that as opposed to Say a commodity network solution And there are discussions with some partners in the market and using PCI express And as a native interconnect the discussions with others on 10 gig ether 10 gig ethernet as interconnect It's all possible. There's again. There's nothing unique but in finnaband compared to other interconnect That is a must to use and I do believe that If you look 12 to 18 months in the future, you'll see customers Not in a lab a car real customers running VSMP foundation across multiple systems that are connected with interconnect That is not in finnaband So what let's talk about actually taking some of those big nodes that you say were expensive and using scale MP on those Yeah, it's became popular recently. So So far we spoke quite a bit about technology, right? But customers use VSMP to do one of three things Customers using that to simplify small cluster. So if you have a distributed software commercial software That need to run on a small cluster eight socket eight node cluster or 16 node cluster Without virtualization you take the cluster away You eliminate that now you have a like a super workstation That the that you need to manage typical customers in that domain use blades dual socket blades Second segment of customers will be customer that need the highest end machines And you're looking at customers that look into where VMs with 500 500 cores Thousand cores customers that looking to have machines with five terabytes and 10 terabytes of memory It's very typical that customers in that domain will use four socket nodes as the building block and Because their target is that if you're trying to get to 10 terabytes of memory It's what's more there's nothing easier than taking 10 or a one terabyte systems or 20 half a terabyte systems And use that as a building block. Okay, it's cheaper than any other a other technology and in fact in these four socket the Building blocks you have enough PCI slots to connect the machines with multiple Independent infiniment lanes and that's improves the performance. So we see quite a bit of those in that segment last but not the least We target the customers that have very large clusters clusters that have hundreds of nodes and we allow an on-demand creation of SMB coming from rocks or HP CMU IBM X cut bright cluster manager and any kind of provisioning system can Go to the end provision VSMP on-demand on existing cluster and typically in that environment You will see customers using high density dual socket systems Now you said a couple an interesting thing a couple times here that I want to get a little more detail on you said that you can use This product to aggregate your memory across multiple nodes now when you do that Is the memory of the remote nodes essentially acting like swap for the master node? Is that what your algorithms do? You can think about it that way and When you cash and pre-fetch data, there are three things you always need to think about First pre-fetch you need to predict what's going to happen happen next and bring it in ahead of time That will save time on the next cash me. So that's pre-fetch Then when you pre-fetch that means that you have a copy of the data on your node That may reside somewhere else. That's called cash, right? You cash data So you pre-fetch the data if it's needed and then you cash you hold a copy of data on your node And last but not the least you need to evict data. You need to evacuate data And to evacuate data. It's also a Pretty complicated task because what data you need to call it the kick out of the node what you're calling swapping Right kick out of the node in order to free Space for the next piece of data that you want to cash or pre-fetch So most of all so the the algorithms of in VSMP foundation are in that area. What should I pre-fetch? Ie what's memory will be needed next that is not needed now? Next is how do I hold a copy of that and make sure that the copy is coherent and last but not the least when I have a Memory pressure. What is the right memory that I need to kick out of my node and call it swap it use your term and Move it to to the other node. So my operating system and the application that runs will have the most a quality Easy And performing experience when it's running on my node So what you're describing here is really just a lot of stuff that new ma stuff I like some of the large iron machines out there where they've got cabling between IR use and they bolt them all together They have little caches on there where they can look up where certain things in memory are and invalidate them in cash lines And everything else like that Is this some of the stuff that you guys are doing also? Yes, and no so when you ever do of course we expose new moment numeric topology and numeric hierarchy to the operating system and everyone that used Just know how to look into numeric topology in the operating system whether that's using a Operating system API is or stuff like hw local waters. You can see the entire system architecture and that's great And but there is one critical difference and the critical difference is that in all these big iron Numa systems the Numa topology is fixed So assume yourself that you build a function that described the distance between a fixed CPU and A fixed memory location for every memory location in the system. You will have a fixed distance We can agree on that right in VSMP the distance between CPU and a memory is not fixed It's changes over time. We move data. We migrate data will change its locality So while we expose new architecture to the operating system Underneath what happens is that memory that you want may for may think that is Far will become close and the other way around now VSMP is smart enough to do that in a way That is beneficial for your application Okay, and that that does smack very similarly of caching and whatnot to that Even even in regular Numa things that are far will temporarily become close and and vice versa But you did mention one thing in there that is near and dear to my heart, which is hw loc So does hw loc accurately report all the aggregate memory and CPUs of a of one of these machines? Yes and I recall that they were an hw loc on a system that has seven hundred sixty eight cores and six terabytes of memory and We got a huge Output from hw loc that described the hierarchy to the finest detail Excellent and that's that's very very nice very good job guys. Oh good. Well, I'm glad to hear So in all this I notice I Want to I want to go back to the question you asked the earlier about the caches You need to remember the difference in size of caches when you look on a big iron machine the size of a cache on a CPU or even a cache on a board is measured in megabytes dual digits megabytes The size of cache in VSM be is measured in gigabytes So it's a thousand times bigger and that's what buys you the performance Okay, so so a very coarse grain You know skipping a lot of details and benefits would be a description of this would be you know a gigabyte size cache Exactly. Yeah, okay And in a set of algorithms that you can run on a CPU don't forget It's a virtual machine that runs on the CPU So you have all the ability to run very sophisticated Algorithm that are very hard to do in silicone in dedicated silicone through kind of cash controller on one of the big R-machine. They're just handle LRU. That's what they do and the end of LRU very very fast But it's LRU and LRU is not good enough for complicated workload Think about the typical case is always to think about a workload that scan a large amount of memory And have a small piece quality like a output buffer where you save the temporal data If you run something like that for LRU the temporal data will always kick out of the Cache once in a while in VSMP that is being identified and that's being locked to the board and Only the big data is getting in and out Of the board so data to set detection and other algorithms that help you maintain the cache At the best accuracy is a critical definitely when you hand a hand in a cache that is that size Okay Well, it strikes me that a critical difference here though is that cash is outside of main memory And so it doesn't count against your usable memory for example It sounds like what you're describing is is sort of analogous well not entirely but kind of analogous to raid Raid five or raid six or something like that where you're actually using some of the usable memory to Temporarily, you know make copies of other memory so to speak and so therefore your total usable memory of the aggregate Is going to be less than the sum is is that correct? That that's that's exactly exactly the case VSMP Cost you quote-unquote ten percent of the memory where ten percent of the memory is splitted as follows 30% of that is the software itself. You know, you have a piece of software. It needs to run some rates maintain some tables and arrays and data structures and that cost you about 3% of the total memory Which is 30% of the 10 I mentioned earlier 7% is used as the basic cash The distributed cash across the system and our 7% is a very high number I think so it's small number off your memory, but very high number in gigabytes think about the following if you take 32 machines That each one of them has 64 gigabytes of memory. You have two terabytes You have two terabytes run right terabytes run the 10% will be 200 gig Okay, and 70% out of the 200 gig will be 140 gig 140 gig is huge cash for your workload and that's what give us this Excellent performance across the infinite link because the infinite link all only be only used in order to prefetch data into the cash Which is pre-populated based on the behavioral analysis of the application that happened in and in runtime in the background So that cash is divided up across all the machines though the Infiniband as fast as it is it's still pretty slow compared to the inside of the box What kind of consideration is done for like process migration if you're running a threaded application They're all designed around the assumption of I have main memory kind of speeds So in fact when you think about it a single core today can access memory for memory only operation At about 9 gigabyte per second That's what you can get when you run a let's say stream bandwidth on the stream benchmark on a single core You get 9 gigabyte per second infinite link today Get you about 3 gigabyte per second. So it's not it's not like orders magnitude slow. It's a factor store the the nice thing that we do at scale and P we predict what's going to miss next and the CPU when it's waits for memory It's not waiting for memory bandwidth It's made waiting for memory and being stole because of the latency of reading for memory So with VSM P what we do is we predict based on the way the application behaves What's the next cache line that will be needed and we will pre Fetch that over the infinite link to the system. So once you have a cache miss We'll try to predict what will be the next 10 cache misses and bring that on So next time that you need a memory. It's already on your local motherboard And therefore while the bandwidth is important What's more important is your ability to predict when next time you're going to eat a latency and Use the high degree of bandwidth that you have when interconnect such as infinite band to bring it on And to the board ahead of time And so that's a major part of your value proposition there is is the you know the secret sauce of having better Cache guessing algorithms than what's in your Intel or AMD or whatever hardware, right? Correct and again you need to you know with all the respect to the amazing work done by Intel and AMD and others There are things that cannot be done without understanding the application context and Doing code scanning and stuff like that the typical cache controllers can do and hypervisors or virtual machine Monitors can easily do okay because we always have control on the processor and we know what also is running So a typical way to influence these kinds of things is like in the open MPI project We actually put you know cache directives cache compiler directives in our code itself Which give the compiler hints to help with the pipelining and the pre-caching and all the things we were talking about before Do you have such APIs? For your product as well for applications to use excellent question. So yes, we do We recommend using them if and only if you know what you're doing very similar to the example you gave before if you will give a CPU Cache hints to the processor without knowing what you exactly doing you lose performance So there are directives to say to tell the machine not to cache stuff or to Dada rent to look memory into specific node and There are customers that using that in fact we provide some Example our MPI library optimized MPI library that we provide is making use of that So the other interesting aspect is our our real-time application Profiler since we have a virtual machine monitor that track the way application access memory IO the weights receive interrupts across all devices We have a way to pull it out of the virtual machine with 3% over and then visualize that and that lets you take an application and go and look into why your application doesn't scale and a One of the difficulties today that it's very hard to get a complete system view of the way up your application Perform on a typical system whether that's a dual socket or four socket or a socket system That's pretty met You can look on things the way they behave from a cache perspective with the application such as a v-tune But to try to look on the overall load of an application on the system bus or the IO bus or how many interrupts My application is getting very very difficult So we give customers the ability to run their application and look for a Profiler how many interrupts across all CPUs which CPU waited for what CPU what type of a Areas in the code in a specific procedure Created the higher degree of load on the memory bus and then that's give you lots of insight into the way you wrote your applications And in some cases if you're really writing applications because most people you know run applications that someone else wrote but if you are a Writing your own application give you the ability to go and optimize your application with the closed loop feedback from the machine itself That's very interesting So now you mentioned in there something else that's very near and dear to my heart is MPI how are people with MPI applications using machines with your software in so first we have quite a bit of those and First why the reason is to for people that use MPI to use VSMP is that we eliminate all the Operational aspect of running on a cluster. So you have a single operating system You have single storage Scratch storage is extremely fast because you can see all the IO all the drives of the different nodes in the operating system And let's say do a rate zero to have a fast scratch. So for operational perspective, you get the think about it like a very large workstation application perspective application seeing operating system and application seeing all the cores and using MPI are MPI spec MPI Performance is equal to a cluster of same size and same hardware So you don't have any kind of a measurable difference in performance It's very very close. Of course depends on application I can show you cases that MPI application run on shared memory machine up to 15% faster than a cluster I can show you some cases where application a MPI application runs on shared memory machine a 5 to 10 percent slower Compared to cluster because just the nature of the application. So in that case VSMP perform as close as possible to a cluster So from that perspective then does the MPI just view this as one big shared memory machine and it uses its Shared memory transport not any network transport. Exactly. So we recommend our customers to use mpitch We have a highly tuned version of mpitch 2 if one wants to use that and it's using an mss And you know share memory communication So when I tested this we actually booted off of thumb drives You gave us these means many thumb drives and we stuck it in the internal USB port on every node and it booted up and then it went through normal pixie And it put our load on it and we modified our load to put your custom kernel. So that tells me this is all way below the operating system Therefore looks like you could support a lot of different operating systems. What what do you support officially though? We officially support the Linux operating systems Red Datsu's a sento scientific linux you name it anything that runs a recent enough version of a linux call it like a two six Eleven kernel and on that will be fine depends of course the size of the virtual machine If you want to run a virtual machine with thousand cores You better have a kernel that is like two six thirty plus If you just looking to run a virtual machine that they run across the four nodes of dual socket You can use a pretty old-care and the community is making great progress in supporting larger and larger x86 chain memory systems and Typical question is why not Windows and the answer for that is that we haven't in demand for a large-scale and windows a virtual machine and when such will come up will be a More than happy to go and do that. In fact, I'll just raise an interesting Perspective here. You can take think about that the 20 dual socket machines With let's say 64 gigabyte each one of them and create from that the virtual machines That will be virtual machine with 40 sockets multiplied by your six cores rate cores whatever you have and 64 gigabytes memory will give you 1.3 terabytes of memory You run on that red dot You open in red at KVM an In KVM you run yourself. Let's say 200 VMs that running Windows Working very very nice Excellent performance and that's is a good way for you to have a single instance running many many VMs Rather than having 20 servers each one of them running few VMs And you need to do migration a live manual migration between nodes and take care of the shared storage So having one big virtual machine that they're on top of that running a partitioning hypervisor such as AVM is One of the use cases to use the SMP foundation and their customers doing that So let me go a slightly different way here. Let me ask you about how is locality managed? So when you're aggregating not just memory, but memory and CPUs How do you is there a secret sauce in there that tries to keep everything local to itself? So can I have threads from one process? Migrate to a different server if their memory is over on a different server for example And let me try to answer that like that. So first to the operating system. We expose Numa machine. That's first Then we track data a Locality whether the execution of the application is Exploiting the data locality to its maximum effect now. That's not always possible for operating system and application Because it may be that when you try to allocate memory it must had allocated from a different Numa node just the life of it that Numa node was full In that case VSMP will move the Numa node or the memory that you're using and make them local again the Numa a Topology is not fixed. It's always changes changing. So we will move memory that Attached to CPU one and switch it with memory that attached to CPU two in order to make sure that CPU one has the greatest Locality, of course, if the thread is then moved by the operating system to another CPU Then the memory will follow the thread Okay, that's a one additional reason why when you have a very very large machine You want to tune the scheduler a bit? There are scheduler controls that you can affect by this control to make sure that the operating system move your Fred Fred's When only when what's a must and use a KMP affinity if he was open MP In order to make sure that you know what's the affinity of your application But that's a you know, that's a common technique for everyone that runs on a new machine Try to make sure that you know What's the execution affinity the unique thing about VSMP is that if you keep execution affinity, we will optimize the data affinity to the CPUs Rather than relying all the on the operating system to allocate the memory Close to the CPUs. You just make sure that the execution will be fixed We will then move the data to be as close as possible to your friends So call it lightweight Numa on that way. You need to be less concerned about it So what's coming in the future for scale MPD, what's what do you have on the roadmap? So support for a larger larger VMs. We always see that we are Looking into having a support for GPUs Somewhere and towards the end of the year We're speaking with different a processor vendors. You may see VSMP foundation running on a Processors that are not exit six somewhere in the future and of course support for other interconnects and not just a Not just in Finland the integration into more and more provisioning systems to allow more dynamic a Virtual machine with the control from job scheduler, which we see customers doing that today Calling them a manual basis. I would like to believe based on work that we are doing with some vendors that you will see somewhere In the future the ability to get your job in the job scheduler job is analyzed by the job scheduler Job scheduler calling the provisioning system and creating a VM on the fly at the size needed by the job That isn't the job scheduler, which is then submitted to the virtual machine Running there and then being killed and the nodes are being recycled for other purposes There are customers that doing that today using pre and post script that they all they made To themselves, but there's a baked product form job scheduler and provisioning systems and I see a couple of initiatives these days and I want to believe that nothing the To this distant the future you will see that there's a kind of Baseline offering from companies in that domain So what's the most unusual uses of your of you know VSMP that you've ever seen? You know things that you didn't really envision that people would would do with it So One of them I mentioned earlier is the use of VSMP to simplify a cluster of the driving VMs I mentioned that earlier the other one and you know, it's unusual but start being usually is analytics and We see more and more customers use VSMP to create large memory systems for analytics and in memory databases is another hot word these days a Big data problems. We never thought about ourselves in a company that solved these type of problems but we're seeing now with the linear IO performance VSMP provides if you take 10 systems and have a IO subsystem for each one of them to Aggregate storage performance you get is the provide you the performance of all of them. In fact, the VSMP is proven these days to provide On eight dual socket systems 20% better performance than the fusion IO for 40% reduced price So see customers are building these Call it a mid-size You know mid-size scale and P these days is like 4 terabyte machine that has Some good, you know 30 to 40 sockets to run in memory database with a very very fast IO subsystem that is built by aggregating the drives on all the different nodes We haven't intended that was not our intention when we started the company, but people are doing that Cool. So what's you've been throwing around some big numbers here. What's what's the biggest? VSMP installation you've seen maybe in terms of RAM or number of processes or nodes or what have you so Unfortunately, not all of them are public. Okay, so sure But the biggest one you can site then The largest public one that is running and I think that they issue the PR. It's not a governmental agency So they're allowed to has a seven terabytes and seven hundred sixty eight cores and There is that's in C4-9 in Poland and At San Diego supercomputing centers, they have a 32 virtual machines each one with a 512 cores and a modest Memory about the two to four terabytes per VM And we just recently not yet announced they won a significant virtualization project in UK where a customer will have VMs with a 512 to a thousand cores and Where the amount of memory Per VM and there is a four to eight terabyte So that's also on the Lowered scale side, but the typical customers are somewhere in the for two to four terabyte size and Anywhere in the 200 to 400 cores. So that's kind of the Meetsize customers Well, thanks a lot for your time. This has been informative. Where can we find more information about scale MP? www.scalemp.com You can find lots of information there. There is a form on the website That allow you to submit your information email and and the name and with specific areas You like to get the more information on and we'll submit you that You know, we just celebrated the two months ago customer number 300 and we are growing pretty fast. So We are looking forward to have more joining the next generation virtualization for large scale machines All right, well, thank you very much. Okay. Thank you. Cheers. Goodbye