 All right. Welcome everybody. Can you hear me? No problem with that, right? Fantastic. So, I'm Stephen Jules. I'm responsible for the AI Center of Excellence at Red Hat. And presenting with me today will be Michael Datesman from the Massachusetts Open Cloud and Landon LaSmith, also from the AI COE. And we're going to be talking about data science in the Open Cloud. So, it kind of kicks things off, all right? The proprietary public cloud, AWS, GCE, Azure, all these things we've all come to know and love, right? They come with a lot of benefits, right? There's a lot of elasticity there. They're highly integrated. There's a lot of security they have built in. You don't have to deal with it yourself. There's some great ways to change the monetization of how you recognize and spend money on those things, which helps a lot of companies in budget allocations. There's a lot of service abstraction, so you can plug things in wherever you want. There's a lot of operational excellence built into those systems, right? They make it easy for you. And a lot of it's built on open source. But along with all that goodness, there's also negatives that come along with it. First off, there's a lot of vertical lock-in, right? If you're embedding into a lot of those embedded services in those environments, it's really, really tough to break your application away from those if you want to try to switch environments or go to something that you've developed in-house. There's a data gravity issue. So does everyone know what data gravity is? No? So whenever you're going to be doing a lot of data analysis, it requires a lot of data, right? Terabytes, petabytes of data that need to be loaded. Once you've loaded that data into a particular cloud environment, chances are you're not pulling that data back out to run something locally or in another environment, right? They've got you locked in because of the cost it would take and time it would take to move that back out. It's not worth the investment, so you're going to stay there. In fact, there was a saying, I don't know if they still say it in the public cloud environment, that give us your data and we'll take all your business, right? Because once they have your data, they know you're going to build everything else all around it. And then from an open source perspective, they're users of open source, but a lot of those providers are not giving back to open source, right? It's a one-way street, so that doesn't help all of us who are open source advocates are trying to take advantage of those things. You're also at the dependency of their life cycle. They're going to, you know, at best, they will roll out new services on a regular basis, but they also have the ability just to kill services willy-nilly, right? So if you decided to use that particular service, it might disappear in a month or two. There's no guarantee that there's any consistency there. And then there's a lot of black box services. So if you're taking advantage of various, you know, pre-deployed things out there, you don't know what's going on behind the scenes. And if you're accountable at all for any type of audit or traceability in the processes you're doing, you don't have visibility into what's happening behind the scenes. So in essence, the public cloud has basically reinvented the mainframe in that we have black box services controlled by a controlled ecosystem on at least hardware. So Stephen, you've convinced us we're not going to do anything in public cloud anymore. So we're going to do everything in private cloud. Fantastic. That's not always easy either. There's a lot of operational complexity with running things in a private cloud environment. Not all firms and companies are large enough to be able to have the in-house expertise to run and manage those types of environments. A lot of the private cloud components, especially a lot of the open source components, don't necessarily give the best user experience. So if you have, from an IT perspective, it might be great. You can stand things up. You give people the capability. As far as having them go in and have a nice interface with which to select components from a catalog, deploy those components, configure integration points, a lot of that's going to be done behind the scenes. And if you have developers that aren't familiar with what it takes to do that type of configuration, it can be a real barrier to success. There's also a lack of diversity. If you're going to host your own private cloud, you're probably not going to have the breadth of things that Amazon's going to offer. You're going to have whatever select amount of components that you guys can support, that your developers need. So you're not going to take advantage of all the latest and greatest that's out there. And then it's costly to support over time. That first iteration, we get it up and running. We're all excited. We get stuff out there. And then the next upgrade comes out and you have a thousand services running on your system. And they're all doing something interdependently. And if you have an outage, your business can't run. So how do we manage that upgrade? You're going to buy a second environment. There's costs and operational considerations to take into account. So public cloud, Steve, you got me convinced I'm not going there. Private cloud, I'm nervous. I'm not rich enough to handle this. And there's all these open stack clouds that are out there. These are generally available. Why don't we take advantage of those? From a distribution perspective, the open stack footprint far exceeds any of the regional footprints of any one of the public cloud providers. They have instances everywhere. So that's fantastic. The downside, none of them are large enough individually to track the diversity of services that you'll find in a lot of the public clouds. They're sort of niche components that are managed for specific targeted audiences and specific purposes. The divergent services between these systems and the different base images, it's not really easy to move things from environment to environment or get things to work together. And then at the end of the day, not a single one of these providers has a mandate to actually solve this problem. They were stood up to solve individual problems. So they're not really incented to work together to handle this at scale. So what if there was a different model we could go after? And we'll call that the open cloud exchange model. So what is an open cloud exchange? So it's an alternative cloud model where there are many stakeholders, and rather than just being a single provider, you have lots and lots of different types of vendors, hardware vendors, software vendors, open source communities, all implementing and operating the cloud together. And in doing so, basically you create this multi-sided marketplace, right? There are different hardware profiles that you can select from. There are different software components made available or different services being made available. And users can freely choose from that set of components and infrastructure and the compute resources to basically provide the services they need for their users. So what's a common use case for something like this, right? Probably the most common is being able to respond to seasonal or peak demands for hardware usage. So if I am a contributor into an open cloud exchange, I will have bought credits in and contributed hardware into the environment to handle maybe my normal workloads. And then I might contribute a little bit extra, right? So I'm not using that. I'm contributing that back to the public cloud component of it. So other users can take advantage of my hardware when I'm not using it. And in return, I'm going to be earning credits for the time I'm not using my hardware, but I'm making it available to the other users of the open cloud exchange. So that then when I hit my peak usage, right, maybe I'm seasonal, maybe I'm retail, I'm able then to over-consume what it is that I've actually provided and take advantage of those credits and get access to that extra infrastructure. But when I don't need it, I'm not paying for all that added infrastructure. I don't have idle time sitting there. I'm not responsible for all of the maintenance and upkeep, and I don't have to pay for the support contracts. Another common example is where we want to access to current technology without private operations. Because you have hardware and software vendors contributing into this environment, they'll use it as a proving ground to get early feedback on the latest and greatest hardware accelerators and software components that they're releasing. In doing so, the community benefits because they get access to software they might not otherwise have had access to being maintained by the developers who are actually building that software. And the developers are getting regular feedback based on users of that environment and they can then continuously update it. So it accelerates the overall feedback loop and development lifecycle for those components. You also get a rich and constantly evolving set of services. There are going to be some long-term players who are always going to remain in the environment, but then there are others who may plug in at various points in time and make things available. So there's going to be this ever-evolving availability of services within the environment. And then because you're distributing this out across multiple providers and those operational components are being distributed between hardware vendors, software vendors, the overall cost to maintain it, you're not bearing the burden as a single organization. Those costs are being distributed. So at the end of the day it's probably not all that crazy. So the current clouds are extremely expensive. They're going to take all your stuff and make it really hard for you to leave. A lot of the industry out there can't actually even use clouds due to regulatory requirements. There's lots of great options in open-source software that can provide you the same capability as a lot of these Blacks Box services but in an open-source manner. There's niche markets that need to be serviced but can't be serviced due to price point availability, things like that for these other clouds. You avoid vendor locked in and you don't have to do this at the scale of an AWS for this to be a good idea, right? There's a point at which scale does not matter. And so Michael's going to come up now and talk to you about an implementation of this exact model. Thanks. So in comedy they say that there are certain comedians that's never good to follow. In presentations there are certain people that's never good to follow because it's really good. So I want to talk a little bit about Mass Open Cloud before I go much further than this. It's five-partner universities. Those five-partner universities, if you take the entire student body and the staff and the faculty, it's roughly the size of the population of Iceland. So it's actually a pretty good size. And then we've got a bunch of other partners. The Air Force has been involved. Some of these are fairly familiar to you guys, I think. Intel, Lenovo, NetApp. All of these folks have at various times contributed to making the MOC real. Red Hat's been an amazing partner as well. Here's a couple more. All of these people are interested in it for the very reasons that Stephen just talked about, which is we don't really want there to just be three- or four-income and cloud providers out there. We want there to be diversity. We want there to be a bizarre, not just a cathedral. I say that looking out at the pipe working in there. So what's the Mass Open Cloud? It's housed at the Mass Green High Performance Computing Center, which is in Holyoke, Massachusetts, where the electricity and the land are inexpensive. There's huge chunks of, well, big cables full of dark fiber running up the Mass Pike out to Holyoke. So there's a very high-speed connection into the Boston area, into the New York area, basically up and down the coast. We have a combination of an open-stack environment, which is we call kaizen, but of those 2,500 cores, 1,088 of them are elastic, which means that if you need to, for example, share resources with a high-performance computing group because they have a peak demand at a different time than our general cloud users do, we can do that. And in fact, we do do that. It's also used for research. We're running open-stack. We're running OpenShift. Here's some of the projects that are being run on there. The OpenData Hub is one of them. OpenShift is another open-stack. ESI is the elastic secure infrastructure, Keystone and Seth. Those are important because one of the things that happens when you start actually putting large data centers that people can do real meaningful research on and actually look at the underlying data that's made available, is you start getting benefits that you wouldn't otherwise get. So for example, the ESI grew out of a project that we had called Hill and BMI. It's moving upstream into ironic and ansible so that everybody who's using the Linux operating system in the future, in the open-stack components of it, are going to be able to do the same kind of bursting that we're talking about. So the way that we've been trying to get this concept of the open-cloud exchange out there is by both standing it up and running it in our own data center but trying to move the components of it upstream so that they come back down to everybody. There's some upcoming projects that are going on that are kind of cool. The New England Storage Exchange is a 20 petabyte lake expected to grow to over 100 petabyte that's also housed at the Bass Green High Performance Computing Center. So what happens if you have your data in the same computing center as your compute is you don't have to pay to move the data around. But the thing about the open-cloud exchange that's kind of cool is if you have your data in a data center in Holyoke, for example, and you're a researcher in Toronto and you're using the Dataverse Project, which is a project that started out of Harvard but it's got a bunch of smaller versions of it around. Harvard's is the largest. I'm looking straight at it. You should stand up and wave. The Dataverse Project. Then you could, for example, kick off an OCX project from Toronto that would run on the data center in Mount Holyoke against the data from, for example, Boston Children's Hospital and just pull back the data you care about. That's the kind of flexibility that we're starting to experiment with. So there's the connection to Nessie. These are things I can talk about. We're working on a reference platform for the Boston Children's Hospital where they're going to basically stand up a HIPAA-compliant enclave using the same reference architecture as we're using for the mass open cloud. If you look on the right side, you see the 400 physical power nine cores. Basically, IBM donated a whole bunch of power nine systems to us. Each of them has all but one of them. So nine of them have a half a terabyte of RAM. And the tenth one has a terabyte. Each of them have four GPUs that are connected at high speed through NVLink. And that's becoming available to the data researchers within the mass open cloud so that, again, they can work on the data that's sitting in the Nessie. MIT is getting, I think, 40 of them or 50 of them. They're going to be using, again, the plan is for a chunk of those machines to be using our reference architecture for OpenShift on the power nines with a plan of federating them over time so that if the guys at MIT need a little bit more power, they can borrow it. If we need a little bit more power, we can borrow it. It probably works out better for us than them, but whatever. And the mass open cloud is available to researchers, to students at any of the five-part universities. And as long as it's being used for things that are open source or small startups, in other words, non-corporate at the moment, but corporate working on open source is fine, then you can get an account there. And you can sign up using a university account. You can sign up using a Google logon, which if you're at Red Hat means Red Hat. And then you'll probably get a note from me if you're using it from Red Hat or a non-university saying, what are you doing? And then we'll turn it on for you. The way you get that is onboarding.massopen.cloud slash sign up. We're currently decommissioning our old Kaizen. So I talked about the 2,500 cores. We're going to be increasing that over the next four weeks as we decommission the old cloud and move the hardware over to the new cloud. So that's the mass open cloud in a nutshell. We're trying to turn that OSDX model into the real thing. We hope that you guys will use it and enjoy it. So I'm landing again. And I'm here to talk about the open data hub. So the open data hub is essentially a community to kind of gather the best practices for deploying different AI, ML data tools on OpenShift on how to deploy it. So we're actively working to get that running in the mass open cloud. I think we recently did a cluster upgrade, so we need to rebuild that. But what we want to do is make it easy for you to run your different data machine learning projects on OpenShift. So I'm going to demo the different parts of it. But we want to make it as configurable as possible. So whatever your preferred method for managing your data, analyzing your data, and running models on your data, we want to make that a reality. So it's currently running in the moc or will be, but it's also available for public consumption. So if you have your own cluster or you just want to play around with it or pull it into kind of any environment that you're running in, we do make that available. So we point out some of the key things that are running on the open data hub. So we're using object, set object storage to store the data. Internally, we're actually using Elasticsearch and Kafka Streams, Hucabana to explore that data, analyze it. And then on the data science end, we have Jupyter Hub or Jupyter Notebooks to actually interact with that data and run models on it. So we want to make it easier for all the users of the data hub or open data hub to make it as simple as possible. So if you're DevOps, it's really easy to deploy. I will demo that. If you're a data engineer, you can easily get access to that data. And if you're a data scientist and don't care about any of that stuff, it's really easy to just deploy the tool and get started on your notebook. So these are some of the components. So right now in the data hub, we have Jupyter. We have support for object storage. Selden is available to be deployed by the open data hub. Grafana, Prometheus for monitoring, Spark operator for deploying on-demand Spark clusters. We're actually working to integrate MLflow, HUE, and I can't remember what's on the roadmap. Hi, first storage also. So I will kick over to a demo. So vanilla cluster, we have the open data hub operator. Can we make this bigger for everybody? So the open data hub operator is a meta operator. And what that means is we make it easy to deploy other operators that will actually deploy the tools that you're comfortable with. So we don't want to control the entirety of the stack. We leave it completely open to you to what you want to deploy. And we say it's close to upstream as possible so that we're not running a highly customized version. So you can mix and match different tools as you see fit. If you have a custom Spark cluster that you want to use, you can disable the Spark deployment that we provide with open data hub. If you have your own Jupyter server, but you want the monitoring and message passing from Stremzikovka, we can make that available. It's completely up to you on how you want to use it. So I'm just going to go through the normal workflow of deploying the open data hub and then I will point out at what point a data scientist would actually be interested in using the open data hub in the process. So we have a DevCon demo project. So this has been an L project. So the only thing that's in this project are the Stremzikovka operator and the open data hub operator. There's nothing else running. So if this is a regular user, so the user I'm logged in is just project admin. This is not cluster admin. In order to deploy the operator, you would have to be cluster admin, but that's one time per project. At this point, this would be kind of a DevOps would enable this operator running in a namespace or they would actually go through the next few steps to actually deploy it. So we're going to do the simple way. That's by the catalog. And since we've already deployed the operator into this namespace, we can go to install operators, click on open data hub operator. You'll get some quick information about what the open data operator. This is also available in any vanilla open shift cluster by the operator hub. But in order to deploy it, you can click on create new. And hopefully, you know, this is just a simple YAML file. This defines what we want to deploy as far as open data hub. So at this point, this would probably be the responsibility of DevOps. If you're adventurous, that is data scientist and you want to configure this to your need, this simple YAML file wants the operator to deploy will allow you to customize it to your needs. But the defaults we have set up are set up so that out of the box, you can run a simple notebook. Normally, I think the defaults notebook memory is two or four gigabytes, which should allow you to play around with kind of small data sets to get a feel for what you can do in the data hub. So if we want to configure any of this stuff, so we have Jupyter hub deploying here. If you'll see, we have the Spark operator deploying here. Monitoring, which is Prometheus and Grafana, and then Kafka enabled. We've disabled the deployment of Selden, AILibrary, and BeakerX. But if we want to deploy those also, we can just change this fully into false or true. And if we want to disable anything, we can set ODAs deployed to false. So it's highly customizable. Disabling any component will not affect other components. So we make it completely modular so that you can pick and choose what you want and have different types of open data hubs running in your environments. So we'll just go ahead and hit create. What's going to happen is we've specified the type of open data hub that we want to deploy. So the operator is going to see that and start to deploy it based on the different configurations that we selected. So in a few seconds, you should start to see some things deploying. Here we go. So first, we have Jupyter hub and the database that's connected to Jupyter hub that are deploying. Pretty soon we should see the Spark operator and Prometheus and Grafana. So I'll just go ahead and show you the pods that are deployed. So at this point, the data scientist doesn't care about it. This is more DevOps. They want to make sure that the pods are deploying correctly. We've set up Prometheus and Grafana so that you can actually view the metrics for those pods. Capture that. Ceph is already deployed. We did that prior. But the user has complete access to it. So we've created the S3 credentials to access that object storage. But we are not directly tied into Ceph, or Ceph object storage locally. If you have an external Ceph cluster, you can connect to that. If you have data stored on AWS, you can connect to that. So while other things are spinning up primarily Prometheus, let's switch over to the data science view. So as a data scientist, you don't care about this. You just want to pull up your environment to access your data. So all you're concerned about is how do you get to Jupyter hub? So what would normally happen is the DevOps would do the onboarding of your environment, set up how you want, and they would just send you an email to say, here's the URL to access Jupyter. So based on your OpenShift credentials, you'll just log in, allow access to read your OpenShift OAuth information. And we customize this deployment of Jupyter hub to allow you, or a group of users to kind of select the type of notebook image they want, the size of their notebook. If your cluster was GPU enabled, you could actually change the number of GPUs you want access to via the web UI, and any necessary environment variables that you want to make available to your pod, or notebook pod. So the key environment variables here are your S3 credentials for accessing the object storage. We have extra environment variables specifically for this workshop to pull down the notebooks, tutorials that I'm about to show off. And if you want to add any additional environment variables, you can do that here. So we're just going to spawn this cluster, or notebook pod. And part of this custom Jupyter hub deployment for Open Data Hub is we also spawn a user Spark cluster. So if you have multiple users running Jupyter notebooks in the space, they'll each get their own Spark cluster. If your environment were set up where you also have a kind of mega heavy-duty cluster that people would want to access to, they can also access that. We don't tie you directly into that cluster, meaning there's no hard requirement that you use it. If you want to substitute your own Spark cluster, you can do that. The same goes for Prometheus and Rafauna. You can disable that and monitor it on a cluster-wide Prometheus and so on. So while that's spawning, let me show the pods. And this is more of an interest to the DevOps. So you can see that there is a user Spark cluster here, one master and two workers. And the user's notebook pod here. So what's going to happen? We're still waiting for this Jupyter hub notebook pod to come up. And once it comes up, we've already preloaded some data in there that we're going to use for the workshop. So I'm just going to go through a quick workshop or notebook that will demonstrate how you would actually access that information that's in object storage, how you can run some Spark operations on it, and then how you would also, at the same time, pull information from a public Amazon S3 bucket. So standard Jupyter hub interface. Let me make this a little bit bigger. So this notebook, we did this in a workshop on Wednesday to demonstrate different parts of the notebook. So some key things I'll point out is we're just using the standard Boto3 library for accessing the S3 server. So we're using this to connect to it, our custom Spark cluster, or sorry, self cluster, and then demonstrate how you would actually create buckets and put data into those buckets. So I'll just go ahead and hit run. At this point, we are going to connect to our user-specific Spark cluster and run some operations on it. So this whole workflow is the only thing a data scientist needs to be concerned with. The DevOps would deploy open data hub and then send some type of notification to say, hey, this environment is up and running. Here is the URL to Jupyter hub. Here's your username and password. Have fun. So we could just go through it. So I won't go in detail for every cell since I didn't create the notebook. But what we're doing is we're actually setting up, so notifying Hadoop, that we have an additional S3 endpoint that we want to access data from. So the BDIS endpoint is actually a public S3 bucket, and the other endpoint is self. So we're getting data from the public bucket and storing it in our internal open data hub S3 bucket. So we queried the Spark cluster to see how many workers are available, and now we're just running actions on that data. Any questions about any of this? Let me know. Can you come back on mini.io? Yes. Yeah. So in the past, I think this is running on 4.1. In 3.11, we were actually using Ceph Nano. But prior to that, we were using Minio. So the underlying server, object storage server, does not matter. We just care about the endpoints. So as long as you have the credentials for it in the endpoint, it'll access it. So all of that's abstracted away. We're not tying you into a specific technology. Although red hat products are the best. So you can use whatever you want. I think in another workshop, somebody had asked the question, what if we have our own Spark cluster? There's nothing tying you to the Spark cluster that we provide. You can completely disable that. The only thing that you're concerned about is the URL to that Spark cluster. So here we're actually connecting to the user-specific one. But if there was an external one that you wanted to send data to, you could do that here. Yes. I'm going to be a scientist and I have 500 or hundreds of Jupyter Hub. Where is my interface for managing all those? Is that Jupyter Hub itself? How do I keep track of all my work? Okay, so Jupyter Hub is just a server to manage multiple users. What you'll see here is we have the files specific to this user. This is backed by a persistent volume. So all the files will be stored. I think it depends on the server, whatever their volumes are backed by. Typically internally we use self, all of our volumes are backed by self. So you do have access to those outside of Jupyter Hub. But yeah, that's what happens. So let me just show you what it looks like. So right now we're logging in as user one. Just imagine I'm opening up another web browser. And I want to log in as user 25. So they get the same screen. They can change options as they see fit. And then we would just hit spawn. So Jupyter Hub allows multiple users to use the same Jupyter server. Its primarily responsibility is to take that information, spawn a notebook according to your specifications, and then allow you to do whatever you want with it. And you'll see we have two notebooks, one for user one and one for user 25. So internally we have one Jupyter Hub server, but we have 100 plus users actually using their own notebook server or spawning their own notebook server. Did that answer your question? Any other questions? So I'm going to bring this home now for you to connect all these dots together. So we saw the different personas. So hopefully it's obvious why we think that the Open Cloud Exchange model is a very attractive model to run the Open Data Hub, because from a software vendor perspective, we give access to users and they can give us feedback on the various components we have. We also have the operations components. So we get feedback from the MOC folks on what it takes to operate these types of clusters at scale. And we're running across a number of different hardware architectures. So we're getting that feedback on a lot more infrastructure components than we would otherwise for our internal environments. And from a user perspective, the researchers, students, communities are getting access to infrastructure and capabilities that they otherwise might not have access to, right? Like expensive things like GPUs, FPGAs, all of that type of information or environment and large-scale storage, right? So they don't have to worry about standing all that stuff up. They're not bothered with the infrastructure. So again, it's bringing together all those core concepts for the Open Cloud Exchange and realizing the benefits of it. And so last year at this particular conference, we announced our EA program. And out of that EA program, we had a couple of research institutions, we'll call them, I guess, and private researchers that ran through this exact environment and gave us feedback that has now resulted in what Landon gave you. So from that perspective, we did learn some lessons around the capacity planning for data science. We definitely need to plan that a lot more diligently up front when we're onboarding users, which has been accounted for in the new intake process. The stability of the platform and services, we need to do some large-scale validation of those before we then roll those out to individual users, right? Because they're going to test things at a scale maybe we haven't done with our small-scale just integration testing. We ought to make sure our priorities are lined up, right? So from what Red Hat's trying to get out of it, what the MOC's trying to get out of it and what the research is trying to get out of it, we want to make sure everyone gets a win, and then that can keep that communication flowing. So what are we doing next? Landon talked. Obviously, all this stuff is running in GitLab. We're doing more around operationalizing, model-serving, and creating public endpoints. So the data science early adopters we had before were able to build their models, do their research, and generate their output papers. Now what we're going to give them the ability to do is, with a push of a button or a couple of lines of code, to actually publish an endpoint and host that model for inference so they can send data into it live if they want to have an external application calling into the components they have built. And then we also want to initiate the open cloud marketplace. So this is where we'll have a number of components listed, open data hub being one of them, and users can come into the MOC and select which components they want to be able to play with. So you can try it yourselves. We're standing, this is out there on the MOC. If you're interested in using it, we are looking for a second round of use cases to run through the new environment, build on what we've had and continue this sort of iterative development environment. So if you have any interest, if you're interested in the open data hub, go ahead and join the community. That's what it's out there for. If you want to be an early adopter, contact myself, contact Michael, contact Landon, anyone on the Red Hat AICOE team and we can get you funneled in. Thank you for your time. Any questions?