 Do you want to start with this then? Sure. OK, everybody. For the first session after lunch, we have Oran Krieger, a professor at Boston University, and a principal investigator at the Massachusetts Open Cloud, along with Stephen Huells, a senior manager at Red Hat's AI Center of Excellence, who will be talking about using the Massachusetts Open Cloud to perform data science experiments. Thank you, Ashley. All right, hi, everyone. So I'm Stephen. I'm not sure how many of you in here before for Daniel's session where he talked about Red Hat's strategy toward AI. One of the topics he covered was around the Open Data Hub and the Data Hub implementation we have in-house at Red Hat and how we've now deployed that up to the Massachusetts Open Cloud. So today, we're going to talk a little bit about the Massachusetts Open Cloud, then we're going to talk about which components of the Data Hub we've put in there. And then we'll do a demo for those of you that are in academia, run open source community. This will be directly applicable to you for how you can put data into that environment and then analyze that data and feed that back into your applications. So to start, we have Orin with us here today, and he's going to talk to you for those of you that aren't familiar about what the MOC is and what it's been designed to do. All right, so I stole this chart from Dan. I love it. What's today's about the cloud? It's basically a reinvention of the main frame. So it's got all these benefits, right? But what they've got is your leasing compute power in an environment, you're locked into it, right? It's incredibly expensive. Like, you can put your data up to the cloud, it's free. It's really expensive to pull your data out of the cloud. And it's using open source, but it's actually not really contributing back to the open source community. So we know that cloud gives you this enormous advantage for users of elasticity, for data centers of better good operations. We can locate these data so it powers cheap. So a lot of people feel that the future of computing is in the cloud, but is it really going to be in these proprietary clouds that lags for about a minute there? So we don't think so. We think an alternative model of a cloud is possible, what we call an open cloud exchange, where multiple different entities can stand up infrastructure and compete with each other and collaborate with each other, where multiple different entities can stand up larger cloud services on top of that. And we can each stand up research offerings alongside these production offerings. When we first started having this idea of creating an open cloud, and I'm not going to talk about the technologies we've done to do this, we started talking to economists. There's already 48 types of VMs on Amazon. What's going to happen if we have something where there's thousands of types of VMs? That seems crazy and complicated. But the thing they talked to us about was you're going to get intermediaries. Platforms like big data platforms, web platforms, HPC platforms, and in fact, we're going to be talking about one of those intermediaries today that can select between these thousands of different options in an open cloud where there's lots of competition and variety. So this is our vision of an open cloud exchange. And the idea is that once we've created this in one data center, we can actually replicate this out to other data centers. This isn't crazy. Current clouds are incredibly expensive. Since this is actually being filmed in on YouTube, I can't tell you the numbers. But let's say it's well over 20 times as expensive to use one of today's clouds. Then if you took a modern data center, at least our data center at the MGH PCC, you amortized the cost of the data center for 20 years, the cost of computers over three years, the cost of operation staff. It's incredibly expensive. The on-demand pricing and even the lease pricing is a lot more expensive than our costs to operate these facilities. Much of industry is locked out of today's clouds. There's obviously lots of great software to develop these with. And there's lots of people that don't want to be locked into the cloud. So we had an incredible opportunity in this region. The universities in the surrounding area had actually built a new data center. It's really up there. There we go. The MGH PCC data center, this is an incredible facility. This was built by a combination of MIT, Harvard, BU, Northeastern, and the UMass system. It's 15 megawatts. I don't know if that means anything to you, but that's like the power requirements of a town, of a large town. This thing has two acres. This is the only place where I've seen where you measure computer space in acres. So two acres of space for computers. Incredible facility, located right next to Hardgrove Dam. 70% of the power comes from, is green. And because of the nature of this, the prices are incredibly cheaper than in Boston. So here's the opportunity to do this. This is a picture of us with the governor announcing the project to create an open cloud. We renamed the MASS Open Cloud from Massachusetts Open Cloud. It's got all these universities participating in the effort, the Air Force, the state. Our core partners, where Red Hat's been a really key partner from us from the beginning, and a lot of other partners that have contributed in various ways to the project. It's real. It's an operating cloud today. These numbers were as of last week. This is this week. So it's a functioning cloud at a relatively modest scale. We have about 400 users directly. And last time we figured out it's over 10,000 users indirectly using the service in various ways. And this is a different chart there. And it's resulted in tens of millions of dollars of grants, because this is one of the few clouds out there where researchers can get involved, do innovation, change things. And it's moving from a project that's a small team developing this as its own kind of isolate project into the IT department's actually working on this as a production service because this is being reused by real users that want to get their work done. So increasingly, what's happened is people don't want to use HPC, aren't just using the data center for high-forms computing, but I want to do data analytics. They want to do machine learning, the data science initiatives from all of our different universities. And so two of the projects I'd like to mention that are just coming up now, Northeastern Storage Exchange, that got funded for 10 petabytes of storage to start off with, but it's doubled in size before it even started. So it's now 20 petabytes of storage. They'll be available for people to host their datasets from all over the region. And Harvard Dataverse, which is the largest dataset repository in the world, that today runs on AWS is being shifted, moved to the MOC. So it's a going concern and we're really excited about this project because this actually is kind of our realization of intermediary. And it's also something which is really in high demand by the users of the MOC. All right. So like we talked about earlier, right? So the data hub is based on a set of common services and common applications that are out there in the data science and data manager world. And we see that basically breaking into three themes, right? So we have our platform and workflow theme. These are things, these are like what the operators are concerned with when it comes to, how do I operate the data platform here at the bottom? How do I make sure I have my identity policies identified? How are we doing the management, operations, monitoring, alerting of all of those systems? And then how are we making this a self-service system so that our users can come in and do this on their own without constantly having to interact with the administrators themselves? And then from a workflow life cycle standpoint, this is something DevOps has down pretty good, right? So this is based on Kubernetes, Jenkins, things like that for managing there, the code pushes, the application life cycles, that layer itself. Moving further up the stack, that's where we start to put in our reusable models and modules. So this is where you have a team of folks that are putting in things like common libraries and common services that have reusability across whether they're in-house applications, customer applications, or someone else's bespoke application. This is where we look to derive a lot of the community investment and intellect in how we deploy what we call our AI library as a series of common analytics that folks can then use in their applications. So think of things like anomaly detection, flake analysis, something anything a CI developer or an application developer can work into their CI workflow to help make them more productive, help them get to the root cause of failures quicker and make their code quality better and help iterate over the releases of those that software faster. And then the third layer is at the top which is basically where the custom development comes in. So this is where communities, businesses, data scientists are coming in taking advantage of those shared services, writing those into applications or chaining together them together in such a way that they're getting added value out of them. In that environment, that's where they also may need to do their own data science experiments. So we can give you something common like a commodity like a flake analysis, a correlation analysis, things like that that are pretty well-defined but giving you a custom trained model to determine whether for fraud detection or for natural language processing or for image detection, that's not something we can actually give you at the commodity level. That's something you may wanna train specific to your data. And that's where what we wanna do is enable that tool chain and that data science workflow for users to be able to come in, easily get access to the environment, put data into the environment, analyze their data, iterate on that and then publish that information out to whether it's a stored model that some application is then surfacing further downstream or whether it's actually getting bundled up into another service that has an endpoint that then other users can call into. And that entire model life cycle management is something we're looking to enable here. And now it updates. Yeah, it should be coming. The bits take a little while I guess to get from back there to up here. There we go. So this is the concepts behind the open data hub in the Massachusetts open cloud and the open data hub project in general. What it's designed to do is basically give you capabilities when it comes to data ingestion, normalization and storage, data exploration around reporting and analysis and then analytic and life cycle management around data science experimentation, publishing that into services and managing the workflow therein. And so the Massachusetts open cloud is basically a meta project around bringing together the technologies that would comprise this platform. So when you're out there in the industry or when you're talking with folks, a lot of the technologies that are used to enable these types of capabilities are pretty common. Everywhere you go, you hear things like Kafka. You hear things like Jenkins. Those things have become commodities and what we're trying to do is alleviate the pain from individual users or companies who want to stand up this from actually having to worry about hosting that infrastructure themselves. It's already gonna be their self-service model. You come in and you take advantage of it. And so what we're bringing together is a set of communities, vendors, users, operators and academics to do that in a fully open source way where we're getting the benefit of everyone's intellect, everyone's experience. Users know what they want to do to actually use the system. Operators know what it takes to really operate the system. And so Red Hat traditionally has been really, really great in the open source world around we know how to write code and we know how to package code and we know how to push that code. This is an evolution for Red Hat where it comes to how are we going to open source operations and how are we gonna open source the data management life cycle entirely? And I'll hit next to that. And the focus here is on reproducibility, right? That's why we're using open source projects feeding into this. The Open Data Hub is not writing a new Kafka. We're not writing a new service broker. We're taking those and we're putting them together in such a way that it leads for a more usable platform. Things are gonna be preconfigured so a user coming in doesn't have to worry about where their spark instance is or where's my data storage. That's gonna be in there. All you need to worry about is coming in and actually doing your analysis. The other thing we wanna be able to do is let projects pick and choose which services they actually want to be encumbered with. Not everybody needs the full stack. You may have data that's already sitting out there hosted somewhere and you just need to bring that data in and then analyze it because maybe you don't buy the expensive GPUs, right? But the MOC has them. So that's great. Take your data, run it in the MOC environment, take advantage of that GPU horsepower and then move on to whatever the next thing in your value chain is. So you can come in at any layer of the stack you wish. And so of all of those things we just talked about, the first use case we have targeted here, assuming it shows up, it's a brilliant use case, is basically around data science experimentation. And this is an early adopter environment right now but it's basically geared toward the data scientist who has data, wants to come in, analyze that data, needs access to spark or TensorFlow and is comfortable with Jupyter notebooks. So what we've enabled is Ceph for our storage, and S3 Apache Spark for how we do data management, TensorFlow, and then a series of Jupyter notebooks out there to take advantage of these capabilities. And so enough talking, right? You actually want to see the thing work and how easy this really is, right? So that's what we'll do next. Let me put this up here. So. The very key structure that we showed there. So yeah, I'm kind of worried. We'll see how good this does. All right, so we've got this, this. All right, so the MOC environment basically, so I'm going to assume like the where this picks up is at the point where you have already gotten a login from the MOC team and you can actually get into the system, right? So there's a request for that. You get a login password and you can log in. So I'm going to hit refresh on this because it's probably going to take it a minute. All right, so I'm going to log in to the MOC environment. The first thing we're going to want to do is get some data up here, right? That's, oh, invalid. Oh, wrong one. We're just trying to prove to you that's right. That's right. There we go. Now we'll have valid credentials. So the first thing we'll want to do, I'm going to upload data, just so you can see how easy it is if you're bringing your own data to the table. If you have data that's already hosted out there somewhere, we can always point to it. But for the processes of this, so we're going to add a new container and as you will test, 0817. And we will submit it. Great. So now we've got an S3 bucket, right? Everyone who's used Amazon or anything like that, I've used S3, you're familiar with that S3 bucket is. All right, so the next thing we'll do is we'll go verify I'm not lying to everyone here. With, so some things that's already been done here. Again, I've already logged into the system. You can, from that system, obviously you'll need access to your individual API keys in order to get access to those buckets, right? That's all taken care of. And again, because this is live streaming, I'm not actually going to click and show you my API access keys, but understand that you click that button, it gives you your keys and you can copy them and use them how you need to. So the next thing you will do, you all want to download the S3 AWS client so you can go in and then actually, command line upload your data. So from there, you would do a AWS configure. Here again, I've already pasted my keys in here, but if you wanted to paste them, you could paste it in enter, paste in your next key enter, and we're just going to leave the region format defaults for now. So that will then have now tell your command line client on your machine, here's how I get access to the AWS, or here's how I get access to AWS. Now we're going to go in and we're going to do something with that bucket. All right, so the first thing we're going to do, and just so you guys don't have to watch me type over and over again, we're just going to cut and paste these commands. So first we're going to go up and look and see that our bucket actually did get created out there. And sure enough, there we see SQL's tests sitting up there. All right, so the next thing I want to do is upload my data. So in this directory here, I have this data file. It's a JSON formatted file. I think this is actually weather data. So we will upload that. And again, for those that aren't familiar with AWS commands, this is pretty basic stuff, right? So you've got, you're uploading it to S3, you're telling it what bucket to put it in if you want it in a subdirectory under that bucket, and then you're giving it the endpoint that you're loading it to. So this Kaizen, this is the MOC environment. This is S3 support by Seth. By Seth, yep. All right, so there we have uploaded it and then just to prove to you that it actually is uploaded, we will view it. And so here we are looking again at that SQL's test bucket and then we can see the data's been uploaded. If you have more data, it would take a little bit more time, but it's as simple as that to get the data into the environment itself. So from here, this is now where we would go, we're gonna go into Jupyter Hub and start to analyze some data. And I'm gonna bring up a fresh one just so you can see it from the get go here. Let me log out and see if you can see it from the start. Okay, so this is what it would look like when you first come to the open data hub, Jupyter Hub on, oh, yep. So that's open cloud, it's OECD, and I'll see if there's any other time. No, it's the S3 protocol, using object storage in the MOC environment. So the storage is all in the MOC environment, it's just using the S3 protocol for the storage. All right, so when you first come in, we'll click sign in. Again, it assumes we actually have access. And you would get that credential. At the same time, you sign up for saying I want access to the MOC, that's the same set of credentials to get you into this project, all right? So by default, when we come in, we are, let's run the Spark one first. So let me stop my server. I'm gonna start a new one. And we'll start with the Spark example that we have out here. So when you first come in, you should be presented with, hey, which notebook image do you want to actually run? And how many people are familiar with Jupyter? I guess, fantastic. I'm not gonna bore you with what these things are. So all of the notebook images here that you see listed, these are things that we're hosting out in the community. Obviously, as we need to grow and do support more things, then we'll add those types of notebooks and make them available. So we've started, again, with some simple ones around using Spark and TensorFlow. And there's a couple more generic ones in there, I think, just using wherever their site can learn. And so this should spawn. It's pending. Let's go back to it. There we go. So now we're up and running. All right. So we will open the Spark MOC. Can you guys read this? Let me make this a little bigger. Is that better? Okay. So I'm not gonna go into the details of what's in this notebook. This was written by another individual. It might be in the room who's giving a talk later about analyzing time series data. So this is data that's monitoring Kubernetes operations on a running cluster and then reports about the various statistics on those operations. I think he's giving a talk later today. But the important thing in here is basically that for your Spark configuration, that Spark is already up and running inside the MOC environment. You don't have to worry about hosting Spark yourself, standing it up yourself and configuring yourself. You can override configurations if you want more memory or you need special configurations. But by default, all you have to do is come in, take advantage of an environment variable that's been set up, and that is your Spark server. That's been done for you. Oh, there we do have keys. Someone's gonna have to edit that out on the recording. Sorry. And then basically you just point it off to the data and that is again pre-configured so we know where the MOC data is and we can execute against it. And so now we'll just run all of it. This takes about a minute or so to get all the way through or so. We'll give it a minute. So again, all this stuff is built in so that we're trying to ease the operation, right? I don't know how many of you have ever tried to stand up Spark, tried to stand up all of your storage, get all this configured, and then actually get to the value add part which is actually writing the notebook and running that notebook. But this makes that process a lot faster. And again, if you had data already out there in Amazon S3, you can point to it. It's the same set of keys, same set of access. And this is where we just wait and see how fast the machine runs. Do, do, do. It should take less than a minute. So I guess while that's going, any questions on anything you've seen thus far? It's a good segue. Not even one question to help me with the time this is taking. Yeah. I'm sorry, I can't hear you. Sorry, a little bit louder. Yes, yep. The Rados Gateway service that you can stand up with stuff. We have that. It's a pretty standard thing, yeah. It worked. All right. Thank you for the questions. It helped. Now everything is run. Here are the results from all of our analysis. Again, Subajit's gonna talk about this later in the week. So, but basically point is approved. Spark's out there running. Data was processed. Everything worked just as designed. The next one then we're gonna talk about here is TensorFlow. Yeah, leave that page. That's fine. Stop my server. All right, now we'll restart it and we'll do TensorFlow this go round. And I believe in, so the TensorFlow we're gonna show right now is non-GPU, but we have GPUs in the environment and we can take advantage of those as well. But again, it's just as simple as if you're coming in doing a TensorFlow project, once this list shows up, we will select TensorFlow image and execute against that. It's normally a lot faster. Must be this room. So the experiments were going on at night. There weren't that many students around. Was that what it was? Yeah. Oh. Yeah, I say we have this sort of relatively modest scale environment right now. It's pretty quick to start. Do you wanna talk about the scale we're rolling out to in the next while? That might not be a bad idea. Yeah. Yeah, go for it. Yeah, so right now this is running on a limited number of nodes. A lot of the effort is being to just get this up and running and make this available. We've got a new environment that's coming up which will be a couple hundred nodes by the end and this will be rolled out. We'll make this available to a broader community at the time when we have that larger scale. And so this is just kind of a proof of concept in some sense, do you wanna? Yeah, and so exactly. So you'll see and I'll put the slide up at the end. Basically right now like that first slide said that we're kind of in the early adopter phase so it's not been opened up for the masses just to sign up and move forward with. What we'd like to do is if you have an interest if your community has an interest, academia, whatever, we'd like to talk to you. Let's see, make sure it's a good fit for your perspective from like understanding what you're gonna get out of the environment to start. Make sure we understand exactly what we're gonna look from you from an early adopter and then we can get you plugged into the environment and we can start working on it. It's out there and running but right now we're gating access to it until one, we make sure we have all of the right requirements for what most people are asking for. We made some assumptions based on the common use cases we've been presented with to date around data science and the type of work so people wanna see supported. And we wanna make sure that's valid before we just say, hey, here it is. The second part is we do wanna put it on a larger scale environment knowing that once you open it up you're gonna get a lot of things coming in and hammering away on it. And the EA environment is not designed for that level of interactivity so that upgrade's gonna be going on. So again, the MOC is not intended to sort of compete with Amazon or anything. What we wanna do is be an environment where first of all we can support all these research uses and stuff like that. And secondly, where the open source community and the research community can actually work together on things. So this is gonna be the one platform like this where the information about how it runs and can come back to all the different open source communities and stuff like that. So yes, we are building a charging model. Right now we don't actually have that integrated into it, because at the end you gotta charge for things if you're gonna open it up to a large population. And we will be opening it up. But we sort of intend for it to be more for the open source community, the research community, and startups and stuff in the region. We don't, our goal is not to be the competitor to Amazon or anything like that. Does that make sense? All right, and so this is now loaded. So this is the MNIST sample you can download from TensorFlow. Again, no magic in the code. If you wanna know what's in the code I can give you the code and you can read through it. But here we will just do a run all. This is gonna go download some data. And it runs through it pretty quick here at the end. We'll see some total. Which one are we doing? Are we still running? Oh, downloading. Doot, doot, all right. So we're training models right now. And then it goes through pretty quick and at the bottom. So here the various runs through the neural network and the training accuracy. And then voila, it's finished, right? So access to TensorFlow. Sure? More on what you heard just now. So ultimately what's new to this all and something you're happy with, would, let's say, somebody wanted an Alabama would it be, is there gonna be, is everything gonna be sourced and clear enough that somebody can start to update this? Absolutely, yeah. So fantastic question. So we've seen that. You guys all believe me it works and it's real, right? So I will go back to this real quick. Maybe present, there we go. Okay, so to your point, if you wanna stand the same thing up, opendatahub.io, this is the upstream community where all of the APB's operators to actually deploy this are being pushed. The use case that's up there today is the one we just demonstrated on how to actually, you can take that and deploy it on OpenShift. That's gonna continue to grow we're looking for people to help contribute to that, collaborate with us. There's a lot that goes into what it takes to actually run this at scale. So there's a lot of trial of, we have it running internal Red Hat. I think you saw Daniel talk about that. We have a specific scale we're operating at and that's continuing to grow and change and we're adapting to that. MOC is a completely another scale compared to what we're running internally and we're learning from that. And I'm sure there's lots and lots of experience elsewhere in the audience and out there that we want to take advantage of to make sure that what we have is a truly hardened environment that can stand up to everyone poking at it. But then yes, you can take those APB's, hit the go button and it deploys. And so maybe a good thing to sort of, what we found we've been working really closely with Red Hat and everything and we've been finding a lot of problems as we stand up OpenStack and OpenShift on OpenStack and we have users using this because a lot of the times the development community, the open source community aren't in a position of operating things themselves at the kind of scale. So what I think the experience of doing that for at those layers has led to now that data hub is going forward and that's a really important initiative to instead of having a decoupled effort Red Hat and then we'll see working together to make it one deployment and get a much tighter loop on that feedback of what has to change. And if you're interested in being an early adopter here's my information, just contact me and we'll start the conversation about getting what's gonna be involved in the early adopter cycle and get you guys access. One of the other things and I didn't mean to gloss over it but there again, data hub's bringing together all the other open source communities where we're actually taking these bits from. One of the bits of the Spark components that actually comes from the ratanalytics.io I think you guys probably are familiar with that or at least I've heard of mentioned a number of times so for anyone sitting in those workshops this week that's the exact same set of bits that we're deploying up here. So we're consuming all the same stuff we're talking about this week here. All right and that is all I have for slides and demo so we finished like perfectly it for questions that there are more questions. Do you have a mic? Sorry I couldn't quite hear the question. Well I mean open stock is a virtualized environment with KBM and it's, you run VMs on top of it. OpenShift is Kubernetes environment which runs containers and we run actually the open shift on top of open stock and data hub on top of open shift on top of open stock. On top of hardware. Any other questions? No? Great. Oh yes. So if I get the idea of actually building all the data inside my company. But we're gonna, by having this here right instead of it Red Hat only gained the experience of their internal users that have gotten experience from a whole bunch of outside users using this in ways they didn't envision which will be a lot better for companies later on deploying that because you know the project will be. And I don't want to understate that point. Like I don't know how many folks in here like work and operations or IT where you've tried to stand one of these platforms of service up for your users and have gone through the growing pains that come along with that. I mean the data hub that you just saw demoed is not at all like what the data hub looked like a year and a half ago when we first started standing this up internally. There is a lot involved, a lot of learning that goes into it. So to think, you certainly could just take it and do it but it really does come from the benefit of the masses of people contributing to it to help harden it, make it more scalable, make it reliable and start to work toward like even that self healing model. We're gonna be deploying the AI library that Daniel referenced in here as well. So there's gonna be a set of preconfigured models that you can just download and they'll be available for you to start calling into. So there's all of that's gonna be in here and that all benefits from the open source community approach. So just to go back to what I said at the very beginning I mean open source, we've all been a part of that community at least many of us have for many, many years, right? But open source isn't enough anymore, right? The clouds are deploying a lot of open source software but the learnings of how to deploy these things and how to operate these things and the diversity of service is something which is actually locked into these clouds. And what we're trying to do is gain a model where we can actually start offering these things at scale the open source community offering them at scale and head with real users so that both we can replicate them to other regional data centers and even back to the enterprise. And without that, these clouds are gonna be the way we lock ourselves into the big proprietary clouds. All right, well thank you. Is this the message you wanted me to come on?