 Okay, I think I think we'll get started here. It's 940. I Wanted to welcome you all to this discussion. My name is Bernie Wu I'm with a company called members and I'm going to talk about software-defined memory And how how we've implemented it with Loki to run big-memory workloads Just a quick background on the company our company is based in San Jose, California and It's been around since 2017 and we're a software developer so even though I'll call our solution a big memory machine, it's a piece of software that runs in user space and Those are some of our investors So the problem we're trying to solve is breaking down Something that people use in the industry a term called the memory wall and the memory wall can be either because of bandwidth constraints on applications applications not having enough bandwidth to get data in fast enough or could be Capacity bound also so both those types of barriers of walls are what we're trying to address and So our goal is to free up these These memory bound applications so I since we're here in Berlin. I thought I'd use this graphic to emphasize that and so Particularly with HPC and AI ML applications. We find a lot of those have this memory wall issue and The the currently most memory is available through the DDR bus on a computer architecture general system computer a general computer architecture and that bandwidth has not kept up with the core counts of our newer machines, so that's also been one issue and The other thing that's going on is we're seeing much more diversification of compute. So people are running GPUs are running TPUs DPUs CPUs FPGAs all sorts of different types of accelerators and they all need to access a The same memory pool over time Memory itself is becoming more diversified you'll see in the next slide the memories going expanding far beyond just being DRAM based DDR DDR sorry DDR based DRAM and then Very significant Later of this year your you'll start seeing what we call CXL memory So I think we're a couple other talks about CXL or mentioning it at least and that is a new industry standard across Think there's about 210 members basically the whole industry finally has agreed on a standard for how to have Coherent memory run across a PCIe fabric so the initial launch of that will be later this year with something called CXL 1.1 and I think over in the next few years what we're going to see is the disaggregation of memory From CPUs, so we're going to go into some new non von Neumann architectures where all the processors are Various types are accessing this memory fabric and a memory pool as a as a as a architecture But the first step will be first to get it to run Expansion memory to run on a local PCIe Gen 5 bus So over time we expect memory to become a first-class citizen in the data center like Software-defined storage software-defined memory software-defined networking this memory will finally become software-defined The so when I was saying memory is getting more complicated This is this is what things are going to look like by the end of the year You guys may have heard of HBM memory. That's basically high-speed DDR Memory, but it's just right next to the literally right on top of the processor chip Then there's going to continue to be DDR DRAM There already is also persistent memory from Intel is called sometimes called optane that runs also on the DDR bus and then at the end of this year the first CXL connected memory system expansion cards will be available So the memory hierarchy itself is going to get much more complicated With latencies up at the top maybe in the 10 nanosecond level and then down at the bottom a CXL connected memory expansion cards expect to be equivalent to one new Mahop away from Wherever your process is running. So the latency is on the order of you know, 100 200 nanosecond range and To to deal with this new hierarchy of big memory of memory Which is going to address both capacity and bandwidth? We need software so So that's what members is working on how to virtualize this software create different pools and different tiers based on latency distances and then also to operate underneath applications to profile their behavior and then do what we call auto tiering where we Promote and demo memory pages based on the heat maps and I'll show you some examples later on and then also We've added some memory data services. So one of the most important ones we see is Snapshotting so we are able to take now a high-speed memory snapshot of of the entire applications memory state plus the associated machine state and Capture that it can also be coordinated with the storage data storage So that we can move Applications more quickly from maybe one instance to another This is something we're doing right now on the public cloud with things like Amazon spot fleet and things like that Spot instances or as are the same thing So this is the the memory machine itself our our software stack and Let me start way by saying first of all it runs again on on the on top of the Linux kernel and user space Right below right above it are the applications and the applications are Currently for us. We've been focused on the most memory intensive ones and a lot of those are in the HPC AI ML area and Again, we create this pooling and this auto-tearing And it's designed to all work transparently with your application. So our goal is to make your application without modification run better Run better or faster and I'll show you examples in just a minute and then also we have this again this snapshot and capability and So besides these specific applications, we can actually run underneath VMs or you or even the KVM hypervisor and So we see potential use cases and this is where we want to explore partnerships with you folks for doing quick rollbacks of Perhaps during system maintenance operations or or ticking system machines out of service that kind of thing I believe this memory snapshot tool can be valuable and shortening those outage periods And then also we've done work if you could see this red Area on that diagram to integrate with various orchestration platforms. So we have ansible. We support ansible We've integrated with slurm LSF UGE some various storage Systems directly We haven't done cinder yet, but that's probably on the list to do next And then Amazon Azure and then obviously we want to do more work Coordinated with with Loki. So we do have things like Kubernetes operators and open ship certifications But there's a lot more work. We need to do in that area These are examples of the use cases that we've already Uncovered with this big memory and again most of them are I would characterize as data intensive in the simulation HPC a ML area Genomics, we're doing a lot of work. I'll show you a couple examples there Eda Electronic design automation some of these jobs take a long time to run their verification test And they use a lot of memory so some of them can run for weeks So we're not only adding big big memory, but we're also adding this transparent memory snapshotting So if for some reason the system goes down or gets crashes or wants to take advantage of a lower cost off peak Instances we can help we help move the applications under those environments AI ML a lot of this GPUs are our memory starved in the sense that they They process faster they can ingest memory So us acting as a cache in front of those especially for some larger some of these image Image-based deep learning things we found use cases median entertainment simulation of You know tools like Houdini Blender things like that We we can we can help with all some of them require large amounts of memory as well Or they need to be rolled back for An artist wants to to iterate on on a on a design that kind of thing HPC physical modeling Quantum chemistry that kind of stuff fsi We've done some work we go to our website with like people like hazel casts to speed up Recovery time recovery point objectives for in-memory databases Which is also that next point and then also we just started working with some of the public clouds But I'm hoping to we can engage this community and add value here also and the value We see here is we can help densify both containers and virtual machines lower the total cost of ownership Lower the power consumption because persistent memory takes a lot less power maybe 10% of the power of a DRAM It doesn't leak like a DRAM capacitor does And then and then again help bring HPC workloads into this this community So these are I'm just gonna run through some examples of These use cases. So here's here. We're running my sequel in a KVM environment and the first column there with an all-black And what we're showing on the y-axis is the queries per second and then then different mixes of PMM and DRAM So again, we're into tiering. So the first column is a hundred percent DRAM The second column is a hundred percent PMM So obviously the last four columns are different blends that that fall in between so So there may be There may be a slight performance degradation or and sometimes in the last column there. We actually topped it Through a little bit of tuning so that's just giving an idea of of of how PMM and DRAM are are associated now PMM is I think on the On the right side as I think is 300 nanoseconds 300 nanoseconds versus 100 nanoseconds so in some cases PMM is slower on the Right and then fast equivalent of speed on the reads I think When it comes to Snapshotting the other thing that's going on is the memory as memory gets bigger and bigger The outage that the recovery time to rewarm this memory replay back the logs or whatever you have to do to recover a system That's crash takes longer and longer and so we can be used in conjunction We can use a memory snapshot which uses a copy-on-write technology to Take a very very quick snapshot of the application and the memory and machine state And then we can either store that in persistent memory if it's available If it's not like in the case of something like an Amazon cloud We currently dump it into an S3 object, but we can do that asynchronously So we are not tying up the holding up production while we're draining it off to a the memory into an object So so we can really compress the recovery time objective make snapshotting more viable And this is important too because in the HPC world a lot of applications never were designed for checkpoint and recovery and And so they they just run in a crash you start all over again, and you may have lost 30 days Here we can prevent that and also from a sys admin standpoint We we found that there's a lot of people using systems where they need to clear off a bunch of Capacity so they can run a live experiment on this on this computing architecture and And so they so the sys admin wants to go in there and transparently shut down all a whole bunch of applications and Hibernate them somewhere and then bring them back after they've done their experiments so there's there's a variety of use cases like that for snapshotting and And again, this just illustrates how fast we we can restore this is a reddest database at 300 million keys It took 15 would take 15 minutes normally to drain the memory just to preserve this the state of this reddest database But if there's persistent memory available, it'd be done in a half a second This is another example we've done a lot of work in the in the genomics area in particular and so there's a lot of things like Single-cell RNA sequencing or population studies Where a huge amount of memory is needed a lot of times there's a matrix of the gene types on one Column and then on the rows or whatever are all the all those samples for different Population so in cancer research is a statistical game and so you have to mind this huge matrix and a Lot of times there's also issues with these instruments They have to be calibrated or you have to iterate a lot because the calibrations or parameters have to be tweaked so the the pipelines get very complicated and We're we've been able to and compress a lot of these pipelines down This is an example. So what you're seeing on that the y-axis are Runtime in seconds so the higher the the bar the worse it is lower is better So after we helped Adjust this pipeline you can see the black columns are the result after we've adjusted this pipeline now a lot of the trick to this is actually Getting rid of IOs and leaving everything in memory So with a bigger pool of memory, we don't have to have so many temp files and swapping that consumes a huge amount of time If you look at a lot of HPC workloads, sometimes you spend 25 30 percent of the entire runtime on IO and I'll show you some profiles like that and and Just by so even though they're technically not big memory applications by giving me a big memory pool we can get rid of a lot of that IO to disk and or SSD and and and drastically compress the Execution time so in this case here we reduce the execution time by about 60 percent And then yes, and if they need to roll backs there are certain use cases where You need to roll back and do tuning or debugging I think there's opportunities actually in the security sector to do forensics You know you have a snapshot of this thing you can try to figure out what happened. So instant rollbacks I think are valuable And again because we're doing a real a true memory snapshot not just cause freezing a system and draining it slowly We can we can I think come up with new use cases and I'd be interested if you guys have any ideas as well And then another example here is metagenomics We'll we'll be publishing a paper at the ISMB next month. It's the the the premier computational biology conference up in Wisconsin and Metagenomics is the field of studying complex biomes like your your stomach has our hundreds of thousands of different unknown Bacteria and critters in your stomach same with a sample of soil So everybody knows we've we've we've mapped, you know yeast cells human cells DNA But there's a lot of unknowns out there that we don't know that affect climate change and things like that or our digestive systems So that area is very interesting to research and we've been able to Now be able to run on commodity Intel servers what normally would have taken a much larger Complex maybe 50 to 100 nodes on a super computing complex So there's this particular workloads called a metas meta spades that creates a large graph so basically When you look at a sample of DNA you only get two or three hundred segment pieces of DNA and And those are all mixed up with thousands of other organisms So it's like going into a library taking all the books off the shelf in a library throwing them all in a paper Shutter shredder and then handing this pile of shreds and saying put all the books back together You know that's what this kind of of application does and it takes a long time in this case takes 11 days to run and It likes a lot of memory because it's actually building a giant graph and Graphs are hard to partition especially when you don't know anything about the graph in advance so So here we're in up in this application. We end up running roughly 20 20 to 1 ratio of of gigabytes per core And we're able to execute this on on on a standard Intel. Well, that's us. It's a high-end 96 core System using two terabytes of memory and so these profiles are just showing how much of a beast this application is it The top one is CPU utilization. You can see it gets topped out all the cores And then the next one is the memory tops out at two terabytes And this is over time So you're seeing it cycle as it goes through its workload and then the bottom two are our Dis reads and disc writes so we didn't even get around to optimizing disc reads and disc writes and getting rid of some of those we just ran this application as is and we're able to execute it and In some cases compress the wall clock clock time But but also more importantly able to checkpoint the same because a lot of times something will crash Just out of out of some random situation and we're able to because we have these checkpoints. We can recover this application another example is in the in the Automobile design area. There's an application called Nastran that's used for noise vibration heart harshness testing And this is an example where a good portion of the duty cycle The first part of this graph where that black line is is all the disc activity. So the first You know third of this thing is a lot of this activity We we are our profilers aware of that so we'll actually actually allocate more memories of disk caching and then later on We find it's not using so much now We take all the memory and use it the DRAM and use it tiered with a PMEM And we're able to and so you can see the last part of this thing is basically a flat usage of about 1.2 terabytes of memory memory and PMEM combined actually and It's just maxed out. It's just doing computations And so when you dissect that pipeline, there's about nine stages to it these are all the stages and What I'm doing is comparing wall clock time of DRAM only versus wall clock time with a combination of DRAM and PMEM That is cost equivalent. So when I say cost equivalent PMEM is is on a cost per gigabyte is half the cost of of DRAM so So we actually are able to expand the memory at the same price Same number of slots and and then drive down the wall clock time by 25% So it's 8,000 versus 6,000 seconds now in a real production environment The data sets would be 10 times larger than this one So you can add another zero after that the 83,000 seconds versus 62,000 seconds, which is quite a lot of time 20,000 seconds saved and To help our partners deploy this tool because not everything not every workload is a good tiering candidate some some Applications may be highly highly highly random accessing like this this first application So we have a tool that we offer to our partners the green envelope here I'm sorry It's hard to read is the total memory being consumed by this application the the red area bar chart here This is a this is a real-time profiler that can be piped into Prometheus and Grafana and that kind of thing but This red area is the hot what we consider the active memory pages the number of active the percentage or the Amount of this memory that's active. So you can see in this second example the second application there's a lot of Underutilized memory or are basically quiet memory is not is not active So looks like you know just eyeballing it's probably 80 90% 80 90% of this memory is Largely inactive and therefore it's ideal for it as a tiering candidate to be used in conjunction with our automatic tiering and And use something like persistent memory to cut costs so now I'd like to in the last few minutes here talk a little bit about how we're starting to work with a Loki and Community and integrate this software-defined memory into it So the let me first start off with it the value proposition. I think for the for an operator I think one of the benefits is we can increase the density of virtual machines and containers Plus we provide what we call noisy memory isolation You know in storage sometimes you worry about noisy neighbors isolation or things like that There's an analogous problem if you don't design your memory architecture correctly So we have noisy what we call noisy memory isolation helps keep the quality of service consistent And then also lower the the total cost of ownership You may or may or may not be familiar with this but on the in the VM where community they announced in November Something called the Capitola project where basically they're installing memory building in memory tearing into their to their hypervisor to allow it to tear Between DRAM and persistent memory it basically in the exact same fashion that I'm describing here and that will allow Them to cut costs, so they have the same value proposition. They're going to be Pushing to densify sometimes up to 25 percent the number of VMs or containers per node On on yours on your on your existing systems And then another big benefit of this is lowering the carbon footprint The I'll show you a diagram in a second But persistent memory can sometimes use as little as one-tenth the power of DRAM And then I believe this also along with our checkpointing technology is is and and they'll allow and Addressing some of these memory bound issues will help bring more HPC AI ML applications to to Loki-based clouds and then And again, I think if you're a operator and you have off-peak opportunity off-peak instances this will allow Customers that couldn't take advantage of those off-peak Instances to use them because we can again do this transparent checkpointing if it's an application was never designed to be full tolerant And then for the users you can see the similar benefits. They can run HPC applications a lot times people need to burst things up onto Because they're out of out of local capacity and you burst things into a cloud and we could help on bursting because of the We can increase the agility of the application with these memory snapshots and for for big memory applications and then Let me go out here So this this gives you a quick idea of how much power savings are as there's the latest generation the 200 series of Intel PM 512 megabytes per gigabytes per memory stick Is you guys you can see one-tenth the power of regular DRAM So I think In some sustainability people have sustainability issues should take a look at this If you're starting out with a green field you could actually deploy fewer servers And use a blend this in this example. We're just using a two-to-one blend of DRAM And PMM that's a pretty conservative mix and that contrast for the DRAM only systems and And you can take out a lot of costs here. We're talking about a 25% reduction in Hardware procurement or hardware procurement costs and then regarding the noisy neighbor issue that there's a native PMM memory mode with persistent memory and what we're showing is a comparison of Memory operations versus Threads and so each process has is grouped in a four each application is four threads a piece And you can see there's a lot of variation in in the native kernel-based memory mode with our user space memory memory tearing architecture See a much more consistent behavior across The all your applications and process threads. So that's what I mean by Getting rid of noisy neighbor problems with memory And that applies both the containers and VMs. These are some deployment scenarios for Running us in this Loki area. We can run on bare metal Which is this first column here we can run As part of we can run underneath the KVM hypervisor and then because we're Linux only application The KVM abstracts as sufficiently so that there are there are windows CAD cam applications Like from Autodesk that can now be run and take advantage of big memory that that are always memory bound There are in this in this use case here. We're just plopped into the particular VM and providing Memory services just a particular application. So so we can be very granular We can pick which applications or which VMs or containers to provide the memory services. So What I'm saying is we don't have to be deployed across the board in your cloud you can you can Stick your toe in the water create a a group of a machine class that has That that has the augmented memory capabilities and then offer those out or you like I said you can deploy in a case-by-case basis inside of Machines or this last one here is a container implementation again. We do have the Kubernetes and slash open shift operator So that was all automated and and down below here. We're showing PMM But again by the end of this year you're going to start seeing a CXL Complete solutions start to shift and starting with local memory expansion cards So for us to get it to work on the open stack side We can currently have to make a few modifications to to Nova to add a flavor and and then once that flavors is Instantiated then you're you're up and running basically so we can talk more about that after this talk. We're interested and then again, we've been working with Red Hat, so we recently got our certifications for the an operator on Kubernetes or open shift and so my my call to action is You know help us partner with us help us break down this memory wall and free these memory bound applications And for further rating There are a couple links here. This is one is a white paper. We published with Intel that's on the I think it's on the Red Hat It's actually on the Intel site also on the Red Hat Marketplace site about what we've done with Kubernetes and containers and this is a the meta genomics paper That's going to be presented next month at the ISMB in in the US So I think I'll stop there unless anybody has any questions. I think we have one minute This makes sense to anybody Good. Yeah, I think it's exciting. I think Virtualization is finally You know this this is the last frontier of virtualization as far as I'm concerned The memory area has always been tethered to CPUs and now because of the diversity of compute It needs to have its own first-class status and and also there's gonna be a lot more choices now For people as far as memory bound applications Okay, if there's no questions, then I want to thank you for your time And then I'll hang around here for a few minutes if you want to just talk to me one-on-one Also our our COO sitting back there John Jang if you want to talk to him as well. Thanks