 If you'd like, I can take notes below a line and we can have a section of above for questions that don't get lost. Sure. Yeah, that way we can, I can capture things so basically this. Yeah, you got exactly. Okay, so you should be good to go if you all want a separate recording of this. I will also record the chat. I can set that up as well while I'm in here. That'd be great. Okay. Actually, or I've made you the host again so you'll have to hit the record button. Okay, so do I just, I just go to record and then say in the cloud. Okay. And then, so that's a separate one. We'll actually transcode the stream after it's done and put it back on the site but this will actually give you access to the chat and everything else. And I can share that with you all after the fact. Awesome. Thanks. Oh, Jimmy, we are making you co host. Okay. Jimmy quick question. Are we expected to, to watch to see if there are any, I didn't look at the page recently or is there, are there going to be comments made on the web page as well or can we say safely ignore that. There could be comments made on the web page as well. Okay, so I'll grab that. But there will be an opportunity for people to actually join in, join the zoom. So, so there's either pad. Okay. So are we supposed to watch the website. Yeah, you might just keep an eye on the chat just to see if anything pops in. If a, if a new message comes in it should say, but I'll also put a message in there to direct people to the zoom if they want to participate. Does that work. Yeah, that'd be great. Or, or, or if they want to put in the etherpad. That's another option. Okay, cool. I will put both up. All right, have a good session. All right. So, so is it true that people can join either through the zoom or just on the website. If they just want to watch without participating. Is that how that works. I believe so. Okay, so then we don't really know the size of the audience. We won't really do much good if they're just participating that'll be dragged. I hope they all joined through zoom. Right. I'm going to go to that page also just to see if I can. Yeah, let's go to the page and just put in please participate directly through zoom. Yeah. That's in one minute. So let me share my screen now so that people know what this forum is about. Yes. Jimmy put it in already. The screen. Yes. The version that we're seeing has the slides on the left as well. Right. I'm trying to fix that. How about now. Perfect. So, so while I'm sharing a screen, I won't be able to check anything else like if the pad or anything. So I'm going to leave that to you guys. Okay. Sure. Ron, if you didn't already, can you click record. Oh, you did already never mind. I think we've started. We give it another minute to leave before we start. Yeah, sure. So, in case anyone is watching this from the website, we encourage you to join through the zoom meeting so that you can have a more an active participation with us. Yeah, we definitely want this to be a conversation. We have a fair number of people now. Should we Yeah, we can start. That's fine. Okay. We do you want to pick it up. Right. So, sure. So, so everyone welcome to the project series forum. And my name is Huilei and professor around Krieger and I will be moderating this forum together. So, so this project, basically this forum and project are associated with the open infrastructure labs. So let's briefly introduce that lab. And also while around and I talk with one to maybe say interactive conversation. So feel free to jump in and ask questions and make comments. Your presentation is meant to, to see the conversation right so feel free to jump in anytime you want. So, around. So, the general idea between open infrastructure labs. This was Michael described it in the keynote yesterday. The goal is to kind of bridge the gap between operators and open source developers to have a real cloud that the open source community has access to. We're creating prescriptive cloud in a box offerings that we operated for real, initially in the moc, then replicated to multiple clouds. So we're starting with academia and federated together, and to have the open source community have access to real cloud, be able to see the problems that operators are facing and operators be directly sort of eventually going all the way to, to integrating upstream code very rapidly into a cloud that people are actually using. And then this what we're doing is we're saying this is about one open source project this is about all the open source projects that are needed to create a real cloud that users are actually using today in the moc we have over 10,000 users of services on top of it. And we expect this to grow substantially as we kick up the next phase of the new research cloud. So, so this involves AI layers, big data, data set repository stores compute function as a service, everything that we want in the way Michael described in the keynote is we want to actually be able to stand all those layers up in a consistent way with the automation around doing that so that multiple clouds are offering the exact same services in the exact same way, and then federated together. That's the focus on open infrastructure labs that that the first focus in some sense is about allowing that standardization and prescriptiveness and federation. But in doing that, it actually lasts for research across these different layers. There's a set of specific projects. We're going to focus on one of those highlight and blue the serious one. We do want to go describe the get into that. Um, okay. I'm having problems. All right. So, as Oral mentioned, right series is part of the open infrastructure labs. Um, so why do we want to have this project right today? Many companies and have to support emerging new workload patterns. Right. One of them is what we call continuous data streaming where you have data streams generated from all kinds of sensors and events continuously. We want to be able to cleanse and and curate the data. We want to manage the quality of data and then protect the privacy of data subjects and owners. We also want to be able to integrate the streaming data with historical data and then and then generate insights and make predictions. So much curation and prediction will have to happen in real time that places a huge demand on the underlying infrastructure. Right. Another workload pattern is what we call passive massively parallel data processing. If you look at machine learning jobs or mad reduce jobs, they all involve a large number of concurrent threats and processing data in parallel against a huge data set. So, um, imposes requirement on the underlying data bandwidth. So, on the one hand, we have this new workload patterns that have new demands. On the other hand, our computing infrastructure is becoming more and more complicated. So we're going towards what we call hybrid multicloud where we have all the computing resources, they're heterogeneous, whether it's CPU GPU, and F F PGA, or all kinds of storage systems right HDD STD and SCM and things like that. So since systems are becoming more heterogeneous. And also, they have also becoming more disaggregated. So, so on the, on the one hand, we have very demanding workloads on the other hand, we have a very more and more complicated computing infrastructure. So how can we tackle those challenges. So we feel that a promising direction is to to enhance the coordination between compute and storage. And that's what we mean by that right today. Computer system is highly stratified, right there are different layers of the system. There's storage, storage layer, there's compute layer, there's this application layer and so forth. People are used to doing design development and innovation within the boundary of each layer, right so they only focus on that layer itself for good reasons of modularity and maintainability and things like that. And those different layers have very well defined and restrictive interface, and that limits the kind of information available to each layer. So we feel that if we can somehow bridge the semantic gap between those layers, so that a lower layer can have more information and have more information on a higher level semantics and a lower layer will be able to do more intelligent things will become a smarter and therefore deliver a better performance. Just think about compute and storage right today. The interface between compute and storage is very restrictive right I request an object or a block and then you give me the content. This is a very simple API. We conjecture right if somehow the storage layer can have more information about what kind of what kind of data processing, we're trying to do, what kind of application we're trying to support, then we can make more intelligent decisions in terms of how to do caching, how to do other all kinds of storage management to improve the overall performance. So that's what project series is about by facilitating better or enhance coordination between compute and storage, and through leveraging the higher level semantics. So project series will not be from scratch will be building upon multiple existing open source projects. Let's take a quick look at those projects. So around. Sorry, always forget to unmute. So there's a series of projects that has been going on in the mass open cloud. This is supervised by Peter Desnoirs and myself with a large team of students working on this. The first one is sort of a caching layer where we've been cached to developing a multi layer cache on the access side of the network bottlenecks within the data center. And so the idea here is to actually be able to share information to across these layers so we're putting now storage where compute is actually residing so we're, we're trying to catch the data where the computer is but also manages a global clash. The idea here is, is as a cooperative cash or collective cash. It may well be that, you know, if, if new data set is being accessed in one cluster. That's a really good predictor that the days can be accessed in another cluster. Layered on top of that, we've been building this service called careers, which is kind of a controller for that cash. And again, this even more violates that boundary between compute and storage is what we've been doing is extracting the directed, directed, directed acyclic graph of what compute jobs are going to be happening out of spark and pink environments. And using that, knowing that we actually know the future of accesses from these different environments to inform what data is being prefetched and cashed and ejected from the cash. So again, these are sort of clear demonstrations of where we're trying to, in some sense, violate those traditional interfaces to get much, much higher performance. We've been able to extract those tags from multiple environments, and we've been able to further use the historical information about how long jobs take with different amounts of caching to achieve much, much higher performance for for applications at the end by actually caching portions of their data sets. So even a step further than that, we want to start understanding that this is a parquet file or an arrow file, or even starting to integrate with red hat team as three select into these interfaces so we're caching only the data people are actually accessing, there's semantic information that caching. The last kind of project we've been doing that sort of, I guess violates the historical boundary is, is implementing a mutable storage over immutable. So what we've been doing is building a block store. So, so essentially the equivalent of the latest block device or elastic block storage. We're building a local NV me Ram as a right back cash, but the right backs are actually happening out of place and we're doing it over immutable storage in the back end. And so what that means is that the client is actually controlling where the data is stored to, and you only actually can see at the client side, that combination of this is really volume storage. And that results in massive acceleration. And we can go to more details on that if people are interested later on. I don't want to go too much. But these are sort of three example projects that sort of show how breaking that boundary and the interface between compute and storage really can result in dramatic performance improvements. We'll be back to you. So, and another project in this area is what we call the semantics aware NDP and catching. There are three aspects in this project. The first one is near data processing, which means want to push data processing tasks from the compute cluster to the storage cluster. And this is different from existing work on NDP. If you look at existing work in spark or S3 basically they're only pushing down very basic filtering and selection kind of operations. Here we actually want to push down a much broader array of operations, not just filtering also functions like join, group by and search. We also want to push down AI functions like data shuffling classification and so forth. And in addition want to allow users to define their own functions and push down those UDFs as well. So this will be one key differentiation for the kind of NDP we're trying to support. Another aspect of the work is multimodal caching, right, if you look at a traditional cash, you just caches source data from the storage system, but here we want to cash additional modalities of data as well, including intermediate data, I'm sorry, yeah, intermediate data, including the intermediate computation results so that we can just access those results immediately directly without recomputing anything, right? Another kind of data, data, whether cash is metadata. For example, information on the index values on what kind of information, what kind of data is stored within each partition and we feel that kind of metadata can help us make a better decision on what partitions to access and what partitions we can safely skip. Right, another kind of thing want to cash is repartition data. If we can somehow repartition data based on the workload, then we can facilitate access as well. So basically, this is the second aspect, multimodal caching, not just source data caching, but other forms of data as well. And then the third aspect is the semantics aware aspect. When we talk about NDP, when we're talking about multimodal caching, we're actually talking about multiple individual techniques, and then how do we coordinate those techniques? How do we make intelligent decisions within each technique is where we want to break the system boundary, want to leverage information on the computer side, on the application side to make those decisions more intelligent and more automatic. Right, so that's the, that's the gist of this project. So I hope this gives you an idea on the kind of technologies we think that can be useful, right, to address today's performance problem in a hybrid cloud environment, and this is where we want to have an open discussion with everyone here. Specifically, these are the questions we want to ask you, meaning, do you feel like this is the right direction we're trying to explore, and then if so, what are the questions you can think of that you want us to discuss as a group today? Right. And also, in addition to the things we just described, what are the other innovation opportunities you see in this area, and what are the hard technical problems we have to solve. And also, we saw multiple projects already we're building on, and then there are also other related initiatives in other organizations, other foundations, how can all these different efforts come together to form a coordinated, right, big, big, big effort. So those are some of the things we have in mind, and then we would like to have that discussion with everyone here. Yeah. And to have kind of last point here is that, you know, this, this opportunity to kind of violate the boundaries between projects, that's something that's going on all the time and kind of proprietary fashion inside the big public clouds, right. And what we're trying to do is provide the open source community the way to do that, and then expose it say through the MOC to real end users. So we have all these people doing AI on top of the cloud that are doing that are using these different services. And so this gives us the opportunity to actually violate those boundaries. And actually, we wanted to actually ask this community, what do you want to see happen. And further, how do we want to, you know, some of these projects that we just outlined in some sense are research projects research projects that break those boundaries. What are their existing open source projects from this community and how do they, how do we see that sort of working together between the broader open source community, and the research community which I guess I'm more representative of. At this point people start jumping in with questions or comments, either on the ether pad or, or actually just unmute yourself and join the conversation. And we're also hoping that we have answers for your question. Or that we don't, but they're interesting. Okay, this is Tony, I have a couple of questions. May I ask. Okay, um, and DB has been around for quite sometimes. As far as I know, a lot of company and different vendors have their own implementation to support various computer use case. For example, ssd when they're use ssd controller to offload storage side to compute tasks into ssd itself. Data accelerator has been built based on certain FPGA or certain dedicated hardware to offload a specific workload. So realize that there is a need for a standardized those efforts among industry and open source snare has several groups are working toward that goal. So, is serious project intention to collaborate with the snare related groups and the standardized NDP at least the area. You guys mentioned about like a big data AI. This is the first question. Go ahead. So, um, I guess I tend to be, I sort of think that standardization efforts take a number of generations what, what I'm hoping that we're going to be pushing really aggressively for is POCs in common environments with real users, right where we demonstrated at some level of So, I think we're complimentary to that and I think we'd love to collaborate with those standardization efforts but at the same time in some of these cases we're really pre standardization right and let's actually figure out what kind of games we're going to get from that. But Tony, you know a lot more about that does that make sense or should we really be do they actually is that evolved enough that we should be adopting things that they already have standardized. I think that's a great answer. I, as far as I know right snare has a certain standard try to standardize several like a memory compute and sorry side compute effort and I also try to develop a different technology so I just want to make sure that this great project and the technology can can contribute in the earliest days so that snare all those party have a, at least some consensus that is in the picture. So otherwise, if it's the path to die, die work is hard to, to, to, to combine them all together in later. So, so earlier involvement, maybe I do understand that your concern that is really early stage, but at least maybe how a loose collaboration is approach, you know, at least the serious is in their mind of, you know, when they are trying to, you know, make a as a standard right so. So, so Tony this is my take right so first I feel that the term near data processing or MVP is a abused term when talk about talk to people who are working on MVP, they might tell you very different things. And so my take is that, if you look at the memory hierarchy, the main memory right storage class memory SSD, and things like that. There's ongoing work to push computation to each layer of that hierarchy, right. First, now all in the people are the same, meaning they're really focusing on different parts of their memory hierarchy. That's one thing. And secondly, if you, you look at what we are doing. At least what we're trying to focus in series is on a different aspect from the aspects you have mentioned, we are focusing on how to push the computation from the analytics cluster to the storage cluster over the network. So that's the major help we're focusing on initially because we feel that's the path that will make the biggest difference. Obviously, in ultimately right we will have to leverage all the work and then be able to push competition, all the way down through the memory hierarchy. But right now we won't be able to cover all the all the hops on that path. We're really focusing on the most important half, which is from the analytics cluster to the storage cluster. So that's what I want to clarify when you need at least in the beginning we're not trying to conquer conquer the entire the memory hierarchy. So that's one thing. And so in essence our work is complementary to what senior is focusing on. And secondly, I agree with you and around in the sense that standardization will have to come after the technology is more mature. So right now we're still at a phase that we're developing the innovation, developing the technology. I wanted to Peter disneyers my Eastern I wanted to follow on with what we said here, what, what we have in this here is seems like something where, you know, I've got to actually run one industry careers and I've been burned multiple times through my career by various standards group interactions. So I think here what we're seeing is that because of the mix of open source, and also, you know, academic research, we have something where the initial trial of limitations of things that, you know, in provider in most companies will be done internally as the concept are instead happening in the open here. So this is really the work that needs to come out and possibly fail. You know, before, you know, but hopefully work before you can call us around the standard as opposed to, you know, my early career was marked by following standards where people go way out of out ahead of the technology and letting directions that turned out didn't work. And I really, I really like that point is that one of the things we really like to invite is people that have other aspects of this that they're working on to kind of experiment on it in the open, you know, demonstrated in say the MOC show expose it to real users. And so we can do comparisons on the same hardware and same kind of environment with real usage. I mean, that's kind of how the public cloud is so successful and innovating is they develop things and they have experimental services and people start using them, and then they can evolve and wrap