 Hello, welcome. Thanks for joining this session. It's a very large room. So let's get cozy on open source fast track with operate first. The general concept is about how to open up operations so that we can benefit from the same principles that's made open source software great. So operations and open operations matter beyond just free hardware and free services, the free tier, but it's all about collaboration and creating something better together. My name is Marcel Hild. I'm out of Germany working in the office of the CTO at Red Hat in emerging tech. And my name is Oindrilla Chatterjee. I'm also working in the emerging tech. I come from Boston, United States. So a quick glance at the agenda. First, Oindrilla will talk about how the OS climate community leverages operate first and the operate first community cloud. Then we'll talk about the general principle and tenets of the operate first concept. And then finally we get on into how you can leverage and participate in operate first because it's all about the community. And now I'll hand it over to you. Thank you, Marcel. So as I was preparing for this talk, I was inspired by my colleague and team lead, Eric Erlinson, who also works on OS climate to actually go back to the corporate travel website and look at how much carbon was released into the atmosphere to bring me here to Dublin from Boston. And I figured it used some 1566 pounds of CO2 and the same to go home. So I'm just curious how many of you get such sort of metrics when you book travel or prompted this little thing? Okay, couple of you. Cool. So details like this can really help a company make sound financial decisions that include data science, financial modeling, and all of that stuff, which is largely not accounting for climate change for the most part. So the OS climate community or the open source climate initiative aims to address that. So they basically use open collaboration techniques and tools to gain the scale that is needed to work across organizations to meet the goal of sustainability. So in order to share the models and data, which inform these tools, they established a global data commons, which consists of data models metrics for this larger working group. So basically the OS climate initiative uses open source tools and techniques to develop data tools, models, which are required to meet climate goals. So it has involvement from various financial institutions, large companies, organizations, everyone who's trying to meet their financial and sustainability goals. So it's right now building a nonprofit, transparently governed public utility of climate data and analytics tools using all open source best practices. So as you can see here, firstly, there's governance, licensing, collaboration structures that enable groups to function in an open source community and actually work together. There's also a curated library of public and private data sources, and it's all governed under that organization. And lastly, modeling, which also includes data science modeling to assess climate related risk. So that brings us to the open aspect of this. So in general, as most of you are familiar, the open approach to collaboration can be defined as anything which is open, which anybody can freely access, use, modify and leverage for all purpose of requirements that can preserve the provenance and the openness of the artifacts. So we can then see that there's stopping at source code level itself, truncates a lot of potential value that can be shared. So here is why the open source community requires and benefits from an open approach in community governance, software and operations. So community, basically empowering the community members in the OS climate organization and making them self-serviced and empowered to go ahead and do the same with the other people that actually reduces and spreads the load of mentoring work, for example. And all that good open processes are also about self-teaching, which basically means if you know how to practice all of these in the open, you can also teach this to the other community members. Then coming to governance, one of the biggest challenges of forming an open collaboration is not only how you bring people on board, but also to show them how to get involved. How do you do that when they are looking at the community but not essentially engaging with the community? So governance can be a process for how to get things done, but also a living proof to potential community members and to show them a way to participate and a known way to participate that actually works. Then coming to software, this is over and over again, we have established this in all lots of projects and communities and having a fully open operations and software environment increases the speed of developing for the OS climate community and also reduces risk and makes things more collaborative. And finally, open operations, which I'm going to shed some more light into the next sections. But open operations essentially means having transparency in the environment and also having full transparency in the processes of how one actually configures and deploys the various tools in a whole software application. And this is where the operate first community especially comes in. So let's look at the OS climate community in terms of numbers and size. I just got this this morning. It's now 204 community members. There are 10 community projects, 49 repositories and ever growing 18 public data sets, which are originating from different financial institutions, nonprofit organizations on the basis of which a lot of modeling and tools are being based off. And over the last six months, this has grown to X so you can imagine the scale at which it's growing. So sharing some more light into why open operations is also needed, considering the scale of this community. So the goals within the community was that we want all the deployments and all the operations to be actually configurable by the community members themselves. Just because of the nature of the community and the different organizations that are involved in it, we wanted this to be as self-serviced as possible. That means all kinds of changes should be fully in the open, open reviews and discussions and also the artifacts should be durable. So the default solution could be to resort to a cloud provider like AWS Azure, but you may not have the budget for it or if you're looking to completely go the open source way, you might end up with a complex and quite a tedious architecture and which can sometimes affect your growth. And for a large community like this, which also has these individual organizations and their data involved in order to get faster to their goals, it makes sense to leverage an existing community cloud which already has this infrastructure and tooling set up. And also the deployments and operations are done in the open. So you have clear visibility into what's happening, what's being deployed and you also have a say into what you need for your application to grow. So that brings us to Operate First. With the Operate First initiative like Marcel mentioned earlier, we take these open source principles of software and we actually apply them to operations. So this initiative is all about operating software in an open cloud environment, which also includes using GitOps for running your clusters. The Operate First cloud runs a public instance of the open data hub project, which is a project which integrates a variety of open source data science tooling, which was very useful for the OS climate community to begin with. And also using a community cloud means more than just having transparency into the environment. It also means having transparency into the operations involved in running the open data hub project that the OS climate initiative uses. And there's ongoing work in accumulating useful knowledge and resources and practices in form of public issues, pull requests on how actually one configures and deploys all these tools on an existing environment. And in this case it's an open shift environment. So how does that essentially work? So an OS climate community member wants to change a deployment on a cluster that can be easily done by opening a pull request on Git. That usually means modifying some sort of customized overlays or some YAMLs. For example, here in this pull request, my colleague, actually this was opened by my team lead Eric. He's just trying to open a PR to upgrade a version of Trino. So it's a simple change on Git. You can see who proposed it, who merged it, who approved it. The main point is that it's on Git and there's open discussion and review on each of these changes that are being proposed by community members. And there's an Argo deployment that detects all of these new changes and it will automatically resync the deployment on the actual cluster. And all of these is being done with open source principles. So you're able to essentially scale out the ability of the community to make changes to things themselves, but also have a review process and full durability of artifacts and a history of things if you are to go back and look at it. So just a snapshot into the tooling that's available on Operate First by running the Open Data Hub project. It now all fits in equally like some Lego blocks rather than being a mumble jumble of things. You have all of the different tooling around data storage, visualization tools, SQL databases, ML pipelines, data storage engine, cloud orchestration platforms, all of that neatly fit in, which is being leveraged by the open source climate initiative. And with that I will hand it over to Marcel who will go in detail into the Operate First initiative. Okay, thank you. Check, do you hear me? Good. So going from that very specific example of a large community, I mean you saw the size and the ever growing size. You might think, I don't use Jupyter Hub, what is Jupyter Hub and how do I pronounce Trino? Is it Trino? I don't care. But taking a step back, what Operate First means is opening it up to a community to contribute and make something better. Right? So you will find an environment that is fully inspectable and fully dissectable by everybody. Think of what open source, what made open source greater, what the openness means in open source. You have this funnel of contributors. So like you have 100 people using a project and then maybe 20 people start having problems with that project and they open up an issue. And maybe out of these 20, 10 start working on that issue and one eventually contributes a fix. So that's this huge funnel and even these 90% of people that are just looking at the project, they are also taking benefits from it and they are learning from the stuff that others were doing. So for creating software, it's quite common these days that you just go on Google then you end up on Stack Overflow and then these days you don't even need to use Stack Overflow because you have the AI built into your code editor and it's giving you the examples right in your editor. So you're already building on top of the collective knowledge of all the developers before you. How's that for operations? How's that for creating larger deployments? Sure, you can look at a Hello World example of how you would deploy Trino or Grafana or whatever. You can spin it up on your laptop but you can't peek into an actual production like setting where you would learn how that's being set up or where you would take some of the lessons learned there and inspirations on how to do it locally or in your environment. So how do we take that data, this knowledge and these processes which are by default proprietary because it contains user credentials, it contains logs, it contains metrics that you can't open up. And that essentially means you're back to the drawing board or you have to learn on the job how to do operations or you buy the Google SRE book and then you learn from pros that they've written on how to set up your own stuff. And so that essentially means everybody's reinventing the wheel and everybody's starting from scratch. So in order to build such a community, you have to be inclusive to all personas. It's not just the developers that you want to address, it's not just the hardcore operators that you want to address. You need to be inclusive and that means from a technical point of view but also from a professional point of view. So speak to the beginners as well as to the professionals. And let's see if we have all these folks here in the room. Who of you is in operations? Cool. So you're operating stuff on a daily basis. You're welcome to join the community cloud and contribute operations or take inspirations on how we operate stuff. Who's a developer? Who's developing stuff? Who's developing stuff? Good. Also a portion. So you need to run your development, your code somewhere. You need to deploy it somewhere. Maybe you don't deploy it but you do it locally on your machine. But maybe you want to peek into a production environment where you actually have a CICD pipeline and to get some hands on it. So have your demos there. Who's just using open source software? Good. I'm also using. I mean we all use open source software and it's easier to go to an environment where something is already deployed and then you can try it out. After you watched all the YouTube videos and you watched all the fancy and glossy websites and now you want to get a feeling for it. So it's way easier to go to something than spinning up your own containers and trying it out. I know support sometimes it's tedious and but that's also, I mean support in essence is helping users, right? And the best help that you can give somebody is self help I guess. So that's what makes Stack Overflow and GitHub these days so great because you can see an issue. Somebody has already an issue. So if you file all these operational issues also in an open way you can help project support by exposing these previous issues and incidents in a public manner. Instead of reopening the same one because the other one was hidden and nobody could find it via a search. And you don't find any public incidents published. Maybe you see a downtime of a service but you don't necessarily see to the nitty-gritty details why that downtime happened. You might say okay that happened because of a power outage or a network congestion but was that actually this bit flip in a setting that caused this? And does that strays or that log file line up with the one that I'm seeing in my deployment? Probably not. Software Architects. Who's architecting larger environments and doing all the stuff on the drawing board? Great. So you learn from your previous engagements from your previous architectures that you deployed. But looking into another environment where somebody actually deployed somewhere as something and had a certain way of thinking why they chose that architectures usually also hidden in customer contracts. So you get these example architectures but why does that make sense for this customer and was it because of infrastructure? Was it because of other requirements? You usually don't know. And finally since I have some backgrounds and some inherent law for our AI overlords I hope that at some point we also have model strain that will help us do all that stuff in an even better way and manage it with some artificial intelligence which requires data and we produce data in this environment. So you see it's meant to be inclusive for everybody. So when I'm later going through some of the examples on how to contribute I think everybody has their niche to contribute there and have fun and engage with this community. And it's just for the sake of having free resources. So in essence we're trying to build a hybrid cloud environment with full visibility into the operation center. As I said previously usually the barrier stops at the support or the documentation side of a cloud environment. You can use these services but how does the back office look like? How are the monitoring dashboards? How is the incident management? Does the incident management look like? How do I set up alerting etc. That's ought to be all transparent so that you can take a sneak peek into it without signing up for something. A quick tour on the environment. So that was the concept of Operate First. Now you actually have to build something. Red Hat is internally and externally also doing deployments and services, managed services under the Operate First concept. So we have some services that are open for contributions. But then we don't expose the logs because they contain customer centric data but you could contribute to the source code of that service. As a matter of fact all the open shifts dedicated services they have a lot of their runbooks and a lot of their alerts and tooling hosted free on GitHub. But the actual implementation that's still guided by Red Hat's employee card and your credentials to check in. With this community cloud we try to open it up even more. So that's a cloud environment which started at the MOC that's the mass open cloud in Massachusetts. Boston University data center and larger cluster of 20, 30 nodes running open shift. Actually there are multiple clusters. We have another one. It's not Hetzner anymore but Yonos in the EMEA region. And we have also some instances in classes running on AWS. AWS climate classes are running on AWS where AWS I think donates the cluster resources and Operate First the community operates these clusters. And in the future we'll hopefully also have more from the educational sector or IBM cloud. So it's I think my vision is to have a really hybrid and multi geo and multi architecture cloud environment. Which is as diverse as possible also on the on the on the lower layer on the on the platform layer. Then obviously you need to run something on top of it. So in terms of workloads there's open data which on trailer already mentioned. That's a data science platform. We have project thought another redhead project to do some stack analysis in the data science space. There are community projects from the from the Java world like AP Curio a API tooling. Probably I don't know too much about AP Curio anyways. So Quarkus is hosting some of their stuff there. It's a Python index. So there are some community projects deploying their workloads on top of that cloud environment. We have management and automation in place. There's the advanced cluster manager for deploying several clusters and managing them. There's Argo CD for GitOps. There's Proud for CI CD. There are tecton pipelines. There's Prometheus for monitoring etc etc. We try to treat everything as a service and try to be as open for deploying operators as possible. So everybody who has a need for an operator is free to deploy it to the cloud. Even better or alpha versions of an operator. If it breaks there it's better than breaking in a production environment. We also try to treat stuff like Open Data Hub which could be deployed just for a single tenant. We try to open that also up for the whole community for everybody on the internet. So you can actually use Open Data Hub there with just a GitHub account and use it. So making it really open as similar as a free tier of a public service. There are Kafka pipelines. There's Prometheus could be used as a service for monitoring your workloads instead of setting up your own Prometheus etc. And last but not least I think that will be in the future one of the main assets being produced. Its operational data like metrics from the past, locks from the past, incidents from the past. All that gold that we're already seeing in the software development world we're not seeing in the operational world that is being produced there. Also blueprints, architectural decision records. So how did a certain decision come to life that's also documented there? And I'm also hoping for a shootout between several architectures. So you might have a monitoring solution implemented with Prometheus and the other one implemented with Zabix or something else. So that you have automation with Ansible or with Terraform that you have different ways of doing stuff. And it shouldn't be just one way of doing it but you should understand what are the pros and cons of the certain ways of doing it. So let's get on the keyboard. Like I said you basically only require a guitar panel. And then you can access all of the assets, all of the services that I will show now right now. So if you have a laptop after that session, click on the codes, on the QR codes and log into a cluster. So that's the Operate First Cloud Community website. You go to the Community Cloud here on the upper right corner and then you see a list of clusters that we maintain. It's actually not the full list, there are more. But here you click on the Smaug cluster, that's the larger one in the Boston University. You see a naming scheme. North America is made from European culture, the Tolkien stuff and the Emea is made out of American culture, Rick and Morty stuff. So clicking on Smaug, you will be greeted with this OpenShift login and you choose Operate First as a single sign-on provider. And it'll ask you for your GitHub authorization and you're being presented with the back end of an OpenShift container platform. So if you're never locked into the back end of OpenShift, here's the way to do it. Just with a single click without signing up for a trial plan and waiting until your cluster spins up etc. Here you can look into it and even better so you are not greeted with an empty cluster but you're greeted with a cluster with actually something running in there. So you can inspect the workloads and try to hack it. Maybe even find some credentials that are not governed, hopefully not. But I mean that's also our honeypot out there. So next step, so you looked at it, now you also want to deploy something or you want to ask a question. So you click on that get support button and it'll take you again to a GitHub repository where we have several templates of issues. You can ask a question, you create a feature request or that fourth option here will take you to an onboarding issue to a cluster. You type in your project name, why you want to onboard, what you intend to do there and we're actually also working on some cool bots features where this is being picked up by a bot and then it automatically creates a pull request for you and then it could be just merged with a click or an approval of one of the operators and you see how that name space, how that project in that cluster came to life. Or you think, hey, wait a minute, I don't want, I want to learn something. I just don't want a name space where I can deploy stuff, but how are you actually managing all these namespaces on that cluster? You can go to the blueprints and architectural decision records link, which again takes you to a GitHub repository where we list all these decisions that we took in designing our environment. With a context and a problem statement here in the case of an ARGO CD setup where you want to manage a multitude of clusters and resources with one single instance. There's this app of app structure that's also promoted by the ARGO CD community, but here you actually see the rationale and the thinking behind for choosing this architecture. So you see all the considered options that we had and you see a decision outcome with some drawings, some in more detail, some in less detail, but we try to document as much as possible. Or, and you might want to create the pull request yourself because you want to get your hands dirty. You want to contribute something and learn on not just the strike chess, not on the architectural level, but on the doing level. So you click on the GitHub stocks, which is basically the collection of all the runbooks and the operational documentation on how to reach several services or how to contribute to services. So here you'll see how to get to Grafana, OpenData, etc., and ARGO CD, but you'll also find sections on how to create namespaces and how to use or how to onboard a cluster, etc. With some live examples and I usually love to look for a PR that did something similar and then take that PR and adjust it to my needs. So what you would want to do is create a PR to this GitOps repository that we have. That's the app of apps that contains basically the whole definition for all the clusters out there. So there's a single repository for a multitude of clusters with a multitude of namespaces and a multitude of applications deployed. There's no need to use this repository. So when you have a namespace, you can deploy anything that you like, but you might want to use a continuous deployment pipeline, ARGO CD for that. And this is the way to do it. So you see we have ACM deployed here, ARGO CD itself is being deployed there. And if I'm going into this ARGO CD sub directory, you see that we work with overlays. Here we are in the MOC infra overlay and we'll see the application of apps for OSC CL1. That's the cluster one for OS climate. So OS climate is being deployed from that same repository here. Looking a bit deeper here, that's the path. And then you'll see here this repository URL, which again is the app's repository and then it has a path inside that repository where the definition of some of the components of the OS climate clusters happens. So you can already peek into how OS climate is setting something up and how you might want to set up your ARGO CD infrastructure or structure if you have a similar use case. No need to use it, but use it as a life reference. So after somebody did some changes here, you can go to the ARGO CD instance. Again, it's live there. You can go right there and log in with the GitHub account and see how ARGO CD deployed the cloud beaver application and just have some clicks and get a feeling for that environment. Then PRs are merged. No, they're not merged by Proud, but the tests are run by Proud. Proud is the Kubernetes tooling for running the Kubernetes CI CD infrastructure and we're reusing that tooling in our own instance of Proud where we manage our own repositories, but you're also free to use them on your repositories. So OS climate is in another organization, but they're also using the Proud instance. So here instead of locking into some free services that are also out there for doing continuous integration, you can hook into this service and get a certain control over that service for doing CI CD without actually investing into CI CD. So you can start right at the level of your use case. Case of OS climate is climate data and AI. After deploying something, you want to monitor it. You want to have some overview on dashboards, how your clusters are doing. And again, graphana.operatefirst.cloud is the public instance of a graphana monitoring the multi cluster environment. So you see some, yeah, the utilization here. Actually, on the left side, you see that there's no data. So there's something apparently broken. So you might want to fix that if you like. And you also have some an assortment of dashboards already being deployed, which is also nice. So sometimes you look for a graphana dashboard and the only thing that you get are some screenshots or some examples on the graphana cloud. But seeing it actually in production is always something, yeah, something more inspiring and something different. And last but not least, you might want to use the open data hub. So in my case here, I'm just logging into Jupiter hub. I'm selecting an Elyra notebook image and I spin up my own notebook. This is a pet project of mine where I'm analyzing SIM racing telemetry data. And I'm using that environment for my own recreational purposes on playing around with some data science. And as I said, it's all, it's heavily based on GitHub. It's not to say that you, that it might not work with GitLab or some of the other social git environments, but that was a choice and that's where our community lives. We have 136 people being member of this community. So that you get a little bit more access once you are a member of that. But becoming a member is also really straightforward. You just, you open up a pull request to the common repository to add you as a member. So that also starts with a pull request. Let me just go, yeah. Right. So there's the support repository, the apps repository, the community, the blueprints and a slew of other repositories to pick and choose from. And if you want to work on something, there's always this good first issue label that you see on a lot of the open source communities that's also applied here. So if you want to get some feeling for chores or some, some ops issues that people are working on, look at these issues and probably you can't start right out of right away. But you want to reach out to us, to folks there on Slack. There's a Slack instance and there's a support channel and here you see an even higher number. So there's 407 people that are part of this channel. So that's giving you a feeling for the size. So we're not that huge, but we're constantly growing. So it's about one and a half years now and I think we're on a nice and steady rise. And here, that's just coincidence that Ointriller and I were asking some questions just recently. When you see, you ask a question and then people answer it and you have a threat on solving that issue. And usually it ends up as a GitHub issue or a GitHub pull request. So that's how that came about is also documented for posterity. That's it. So go to operate-first.cloud, an open source community for making things better and things has an asterisk. So that's also a link where we define things because I think it has so many facets that it doesn't fit into a snappy sentence. Any questions? I think we have five minutes. Yes. So the question is because it was mentioned on one of the slides that there are AI ops features in that cloud. And whether I'm optimistic about AI ops in the future and I mentioned that I'm optimistic on AI ops in the future, there are no... I mean, it always depends on how you define AI ops, right? And there for Open Data Hub we developed some notebooks and some tooling to predict the research usage of a Jupyter notebook. So sometimes you do something with TensorFlow and you require 32 gigs of RAM and sometimes you just do scikit-learn where you require less RAM and you only need a single core. So predicting that usage is some sort of AI ops. So I have some bots integrated that might use NLP to talk to somebody on an issue to create a pull request. Is that also AI ops? I don't know. But I think the fundamentals, the foundation for every AI is data and here we're creating a lot of data to build on future AI ops tooling. Yeah, I think the suggestion was to use Prometheus for time series analysis. So for the Jupyter Hub analysis that Marcel talked about, we basically did this. I work on the more data science side of things. So we were actually collecting Prometheus metrics from Jupyter Hub users over the past year or so and actually monitoring the difference between what they actually request on the cluster versus what they actually use. And that was used to offer more efficient sizing, cluster sizing on Jupyter Hub, like better t-shirt sizing, more accurate to what the actual usage looks like. So yeah, that's very aligned with the kind of projects which we worked on. And in case you want to try out a Prometheus anomaly detector, that was actually one of our really early projects that we did like three years ago. So there's some Python code and some container manifests on how to do anomaly detection with Prometheus. That's a very good question. So the question is, how would we handle complexity and a myriad of solutions on the long run when something is less maintained than the other one? I think it's about looking at the activeness score or how often somebody used something or how active it is maintained, whether it stayed or whether it's still running. That's like biology. You have different attempts and some attempts might die off and some attempts might grow. And I think the same happens if you look for solutions in the code space. You would find three, four projects and then you look at how many stars that project has, how many contributors that has, when was the last commit. And I guess that's the same would be true here. Right now we're not at that state yet, so we've chosen most of the time one solution to it, but I'm super open for having multiple solutions. So for setting up a cluster, I think we have now four methods, documented methods for setting up a cluster. Some are easier than others and it will, time will tell what's the best version of doing, whether it's Terraform Ansible or the assisted installer or a clicker button in the advanced cluster manager. Yeah, sometimes it's also dictated by the community which is using it. Like let's say for the OS climate community, we have been moving through different options for ML pipelining solutions. Like we tried out something that did not work. So like Q flow pipelines, air flow. So we are exploring different solutions and we work very closely with the operate first community to see what can work, what can be a new addition and so on and so forth. Cool. Before you leave there are some stickers up here. It's easier to take than taking a photograph, plug it on your laptop and we're trending on Google if you type in operate first where the first hits. Thank you.