 Okay Hi, first of all, I want to introduce myself. My name is Bernie Wu I'm with Memverge and I want to thank you guys for sticking around for the last presentation of the evening What I want to do is talk quickly about software-defined memory And how we've started to integrate it with Loki to run big memory workloads I'm gonna repeat this presentation tomorrow at 940 and actually I since I have a little more time I'll actually talk about I have some more material to cover also for tomorrow Very quickly our company is Started in 2017. It's based in San Jose, California And the three founders came from Caltech and we're currently a series B company and we have strategic investors there as you can see And What we're about is breaking down this since we're here in Berlin I figure I'll use this symbol here, but we're here to help break down the memory wall and free memory bound applications and And what I mean by that is that memory capacity and bandwidth haven't kept up especially with HPC and AI ML applications and The DDR Bandwidth that's on a typical server also hasn't kept up with the core count So if you look at that that's been going the wrong direction also And then also the the other trend you see is that computes becoming more diversified. There's there's GPUs There's TPUs or DPUs CPUs FPGAs there's a lot of accelerators Memories becoming more diversified as you'll see in in my next slide and there's in the second half of this year there's a new class of memory coming out called CXL memory and That memory will actually be part of the CXL fabric the initial instantiations will be running on the local PCIe gen 5 bus for the newest Intel sapphire rapid and AMD Genoa servers and So over time with starting with CXL Version 1 of this year 1.1 this year and then late next year and the following CXL 2.0 We will see memory becoming Disaggregated from the CPU so become an independent what we call first-class citizen like storage or networking or anything else on in the data center and Memory itself is getting more complicated. There's there's HBM memory now There's DDR RAM that's been around for a long time. There's persistent memory that also runs on the DDR bus That's produced by Intel sometimes called Optane memory and then there's also this CXL memory That'll be introduced later on this year so to manage all those different tiers and they all have different latencies and and Properties Requires a new a new layer of software and that's what Mem Verge has been working on So what we do is we create we do virtualization and auto tiering to help Applications transparently consume this memory this stack of different types of memory and then also we provide data services to Allow this your application become more mobile even with this memory stack so Just to give you an idea what our platform looks like our software platform We run in user space Above the Linux, you know in the Linux user space directly below these applications shown in orange And I'll talk more about those applications and what we do is we first we virtualize the memory below us And it create memory pools So in many ways is similar to software-defined storage except this is think of this in a memory context And then we do automatic intelligent auto tiering. So we profile the applications that are assigned to us and automatically promote demote hot and cold pages memory pages and then and The key thing is we deliver all this transparently to the application. So a lot of HPC applications are working with they were written in the 1970s we don't change a bit of them and we can still run them and run them better actually We also implemented something called a memory snapshot that does a copy-on-write Snapshot and also captures the machine state So this allows us to do a rollback of an application or a cloning of this application and actually move it to even to another System so this could be useful for for operators that need to do some sort of maintenance and things like that And also gives us higher mobility and the ability to do what we call a transparent checkpoint and restarting of applications We've also spent a lot of time working With some of the outside orchestration tools. We have Ansible. We have slurm LSF UGE We've integrated with the various storage APIs for snapshot at coordination on AWS Acer and now Loki These are some examples of data intensive use cases. I'll go through these in more detail. These are ones that we've already been working on Our our forte of these quite honestly the the one we've done the most work on is genomics We just won the best of show award a month ago at the bio it conference But we also doing things in EDA AI ML median entertainment when it comes to simulation and rendering With things like blender and Houdini HPC modeling financial services and in-memory databases and then Now I think there's an opportunity to work with you folks here in the cloud to do things like Increase dense density of virtual machines containers lower costs lower and reduce energy costs as well And then also help HPC workloads run better on on these kind of open-stack clouds So I'll go through some of the quick examples of these use cases this is an example here of Comparing my sequel running on a KVM in a KVM virtual machine and we're running sysbench looking at the queries per second This first column here is 128 gigabytes of DRAM So it's a hundred percent DRAM and then these other columns are just mixtures of DRAM and persistent memory and By varying the ratios We you can see Nearly the same performance This is the the second column here is a hundred percent PMM So that's the worst case situation with persistent memory versus DRAM. So persistent memory Typically, maybe on the read cycle, maybe three times slower than DRAM. So you're seeing that in this benchmark But if you use blends of it and we are our auto-tearing algorithms, we can pretty much mass the performance difference And then also another example of a use case that we have Outline with our memory machine architecture is again this this problem of crash systems crashing So the bigger the memory The bigger the memory the longer it takes to rewarm the memory if something crashes and so our in-memory snapshot really drastically improves that what we call recovery time recovery point objective Because we could take the memory snapshot and then a synchronously either stored and persistent memory or a synchronously dump it to something like an object storage for for safe keeping and That is much faster than other conventional ways of checkpointing Applications where you have to serialize and deserialize the memory and it takes it takes a long time So this is an example here with a redis database 300 million keys Normally to drain this thing out of memory or a normal Serialization process takes 15 minutes If you just dump it to persistent memory, we could do it in half a second Another use case that we have is in genomics as I mentioned so a lot times in genomics there's there's a highly iterative process and So there's there's calibration going on and so a lot times there's we're showing that iteration here stage 3a 3b 3c And and a lot of times these applications were written as a single threaded pipeline architectures and what we've been able to do in some of these cases is parallelize those things by cloning Memory snapshots of the application and running them in different with different parameter settings or calibration settings in parallel So we can drastically Reduce the wall clock time. We also can allow people to do forensics on why if they had a problem their application They could look at intermediate checkpoints So these these checkpoints or snapshots can be Programmatically implemented. So if you have something like our studio or something like that that we could give you an application Interface programmatic interface where you can insert checkpoints where you are appropriate in your pipeline. So Overall, this what this does is reduce. This was a 12 state. I'm sorry 11 stage a Genomics workflow for for a cell for a single cell RNA sequencing and The blue the higher these lines The worst the performance is so the lower the better those are in seconds So bottom line is you can see for each stage how much we improved the Execution time or runtime and overall we reduce this runtime by 60% using our our memory tiered operations using Intel Optane Same with rollback as I mentioned earlier rollback is extremely fast using our persistent memory to store the snapshots So I think we've discovered a member what we consider to be a killer application for persistent memory Which is to store memory snapshots and you can see this drastic reduction Everything's down to less than a second basically Another example is metagenomics research this kind of Application that is called metaspase creates a giant graph and graphs It's a de novo. It's a de novo genomics Alignment exercise and so the graph can be enormous it can be 200 times larger than the original input data and This job takes also 11 days to run typically and for the first time We've been able to run it on a commodity Intel Ice Lake system with six terabytes of DRAM and PMM So that actually we actually only needed In this case actually we only needed two terabytes So what we did was we basically created a 20 to 1 ratio of gigabytes to cores on this system We're able to for the first time able to execute this whole program successfully on a single machine So we feel that helps to democratize Metagenomic research and metagenomic research if you guys don't know is used to study the human biome to study Soil agriculture has got a lot of broad reaching Applications if you want to know more about it. There's a link. I've included down here below and this is a quote from this paper We're publishing that basically says, you know that this is a this is a good use case This is we're giving a talk at the ISM be next week on this Sorry next next month on this and this is another example with vehicle noise Nastran, which which was developed space shuttle error We've reduced wall clock time by 25% While holding cost equivalent mixture of DRAM and PMM and this is a nice tool We have where we can Before you even deploy our software you can profile your applications By by process ID or whatever and the green envelope shows you the total memory consumed the red area Shows you the hot memory. So when you look at these two graphs this obviously this first graph here It's not suitable for memory tearing because there's random access is going all over the place On the other hand you look at this application. You can see a lot of this memory is idle It's not it's not hot and and this is an ideal for memory tearing Use case so this can help you identify which applications to move to a memory tiered architecture So now I like to quickly talk about Loki and our software to find memory integration So the value we see to the local community for the operators are increasing VM density and container density plus memory noise isolation and lower total cost of Ownership PMM is only half the cost of DRAM To give you an idea per cost per gigabyte standpoint and then also I think this PMM also only uses one tenth the power of DRAM because DRAMs are based on leaky capacitors And so there's also a sustainability benefit here carbon footprint benefit and then also I think HPC applications Aren't can be attracted now to the HPC to clouds and then we can also do this off-peak check pointing So we do this right now in the public cloud where we we take snapshots Applications that weren't fault tolerant can now be run on spot instances in the cloud We can do the same in the open stack community and for the users they they get to Put more of their HPC applications on the on the cloud have this checkpointing benefit and then also we can do things like bursting and Mobility of applications that are memory-tensive Increase the mobility of those things. So this just gives you a quick idea Again PMM at 512 Gigabyte Sizes is one tenth the power of DRAM and then also Again densification is possible So you can cut the total cost total numbers servers down total number of Of DRAM consumed and then also here's an example of noisy neighbor There's these are all Multiple processes each with four threads. You can see with memory machine. We have a very flat profile. These are these are measuring Ios or memory accesses. So we we don't have a noisy neighbor problem or architecture isolates all the memory objects for each process and Then from a deployment standpoint, there's a multiple ways as we're user space We can deploy bare metal we can deploy underneath the KVM hypervisor Which is our current open stack integration and then provide memory services to the hosts the guest VMs We can run inside of a we can run inside the Kubernetes architecture as a container Using a Kubernetes operator to deploy or we just run in a guest OS inside a virtual machine. So there's a variety of ways to control deploy the granularity So it only has to be implemented on an incremental basis into your data center We did a quick modification of Mo Nova We we could talk about later to get it to work an open stack But we want to work with somebody else to bring it bring that further along make it more automated And also we're certified with red hat the Kubernetes operator also works with the the open source versions so my last ask for you guys is to Partner with us to help break down the memory wall and help free memory bound applications, especially for this community and for further reading I've got a This is a white paper we published with Intel that talks about us Kubernetes and containers and then this is the metagenomics paper that Will be delivering a presentation on next month at the ISMB conference on computational biology So with that I'll stop and I think I'm out of time So But tomorrow if you guys come back you want to hear this again at a slower Baud rate or whatever. I will talk about another project. We're working on called DMTCP Which is for for MPI based HPC workloads that allows us to checkpoint across across an entire synchronous cluster. So Okay Thank you. All right. I think I Don't know if I could take any questions, but I'll just I'll hang around here for a few minutes. Thank you