 Welcome, everyone. This is Danielis Kierdo. I am today together with Diane Mueller. We are going to present from Mark to Science, community development in a data-driven world. All of this research is the result of the last months of working together. Thank you for this opportunity to present at the Open Source Summit Europe. The research I was referring to is based on this IEEE software paper that we released some months ago, in November, December, 2019. As you can see, this is the work based on our experience during when analyzing the CNCF ecosystem, OpenShift and other projects that are kind of related to the cloud ecosystem. So we can see that there are a big amount of interrelations. We need to sync all of the way we are all working downstream and upstream. So this is the main motivation of the article. So, hi. Welcome and thank you for inviting us here today. I am Diane Mueller. I am the director of community development over at Red Hat. And I've been working with Daniel and his team at Batergia for probably the past four or five years, using them on a number of Open Source projects that I collaborate and help do community development for at Red Hat. And if you know Red Hat, you've probably seen this slide before. And it's really the crux of how Red Hat looks at technology development and that in our DNA is Open Source, everything we do is Open Source, and we really strongly feel that it is where all the innovation and technology is coming from these days. But also these days, there are millions of projects to pay attention to. The one that my primary focus is, and probably a lot of yours is as well, maybe is around based in the CNCF is the big crux of the point of intersection for a lot of these projects. I happen to work on a project called OpenShift at Red Hat, and it's Open Source sibling called OKD. But there are millions, literally hundreds of millions of GitHub repos that pop up new ones every day, millions of developers. We have the data from 2018 here. Obviously I have to update my slide sometime soon. But it's really one of those things where we're trying to figure out how to best connect with our, and stay connected with the developers in the communities that we plan. And so some of what we've been doing because OKD and OpenShift is really a function of Kubernetes and all the other projects that are in the ecosystem. We have shifted from, no pun intended, from being a single focused project community development effort to having to collaborate across all of these communities. And to do that is very difficult these days. There are so many projects. There are so many people involved in those projects. So what we've done is taken over the past almost four years, really a much more data driven approach to looking at who's in our communities, how to connect with them, and how to stay engaged with them. So if you've seen the CNCF, and that's really the crux of the talk that we're going to give today is really focusing on the CNCF. There are other projects, OCI, Istio, Knative, the whole service mesh world. But for today we're really going to focus in on the CNCF as our example. And if you've looked at the landscape CNCF.io page, it's crazy. It's wonderful. It's an amazingly healthy, engaged, lots of partners, ISVs, upstream projects, all of these people. We need to remember how these projects and roadmaps and releases all fit into the OKD and OpenShift space. So to do that, you can't use traditional, simply hallroom conversations at conferences, which I miss greatly these days anyways, to stay informed. What we've tried to do is start to develop some strategies for continuous connection. We're all connected via Slack, via IRC still, Twitter, GitHub. If you don't have a notification popping up on your desk every five minutes, 50 seconds depends on how many Slack channels you're in. You see the desire that everybody has to connect. And with the world now in COVID and virtualized, we really have to be able to figure out new ways of meeting our communities where they are, create the content and curate it to enable our customers, our end users, our project leads to educate each other, to connect with each other. So a lot of, there's a lot of work going on in the background and trying to encourage positive engagement and get people to share and contribute and give feedback. Really, that engagement is the key. How do we know who to engage with, who is engaging in our communities, and use some new tools to automate some of the execution of the outreach and the connections. In our world, we do organizational-based membership to speed up the number of people who can be in the OpenShift community, which is called OpenShift Commons. And we do a lot more than we probably did in the past around relationship management and automating the workflows. And the key really for me is trying to do all of this without dehumanizing the connections, to keep them healthy. But to keep them healthy and to do all of this, we need the data. And that's where the work that we've done with Bitersia and over the past few years has really allowed us to apply a more data-driven approach to community development. And so I'm going to let Daniel take this from here and tell us a little bit about how we're using the Bitersia tools to do that. Thank you. Thank you, Diane. So I really like your point about not dehumanize the work of the community manager and how to deal with the community. Because what we are trying to have here is to scale ourselves from 10, 20 years ago, where communities were a couple of hundreds of developers to communities as CNCF. We're just the open source projects are around thousands of developers, right? So with this kind of tools is where we can have and bring this new data-driven approach to look for those newcomers and help them during the onboarding process to understand what other vendors are facilitating their trip or their journey to be successful with a specific technology, right? And so the analysis that we've done is based on Grimoire Lab technology. So this is a project from the Linux Foundation. It's the acronym for community health analytics for open source software. All of this is, of course, 100% software. So you can use this and let us know what you think about this. But in general, what we have, what you can see on the left are the different data sources. So if you think about the data sources, some of them mentioned by Diane. We have Slack channels. But for development, we have gate repositories. We have issues. We have code review processes. It doesn't matter if you are using GitHub, GitLab, Atlasian Stack, or any other self-build infrastructure. It happens that there are typically from five to ten different pieces of infrastructure that we are all using for communication channel development, et cetera, et cetera. And all of these can be extracted because all of these are publicly available. So the data source is leaving trace anytime that we are committing a piece of code or sending an email. So all of these can be extracted by Percival, which is the tool you can see close to the data sources. And this all is stored in Elasticsearch. So the tool in this case is using Elasticsearch and a downstream version of Kibana. We have on the top a tool which is kind of key for this discussion, which is Sorting Hat. And Sorting Hat is dealing with identities. We have specific policies as GDPR that are important for different regions of the world. And in this case, this is open source summit Europe. The Sorting Hat is GDPR-ready. And this means that this can deal, is dealing in a different database with all of the identities and affiliations information from all of the developers that are participating in an event open source project. So the use case that we have for today is CNCF. In this case, we have aggregated all of this information. And we can minimize, we can remove developers if this is asked by them. So this is about having the right tools to have a specific analysis for open source project. And in this case, this is open source tools to tonalize open source communities. Then at the very end of the tool chain, you can see the browser and key bitters. So those are the tools where we are, in this case, building a specific dashboard. And the analysis that you will see now are our dashboards that are produced with Grimoire Lab. So what we are trying to do is to make sense of all of these carriers that we have in the data sources, all of this information that is available by the CNCF ecosystem. And then at the very end, produce business value in either in a dashboard or you can analyze, you can query directly. Elastic search, for instance, and produce your own Jupyter notebook. So that depends on you, but information is there. The important thing is about having all of these centralized somewhere. The next slide, please. Yeah, and the use case for today is the cloud native computing foundation. So these are the graduated and incubated projects when we were doing this analysis, which is like two, three months ago. So maybe there are some new now. I think a few other ones have graduated since then. But I think the key here is really taking a look at the connectedness in these projects. And from our perspective, every one of these projects is something that impacts on OpenShift and Kubernetes. And so staying aware of them and seeing the connection. So if you can explain a little bit about the connections and what we're seeing here, that would be great. Yeah. So each of the dots. So we have like these biggest stars in the middle as Kubernetes and others. But if you can probably see like a small pink dots, right? So those pink dots are developers and each of those developers have participated in different projects. In some cases, one or more projects. We see that Kubernetes, they have like the biggest right number of developers that we can see that are connected to Kubernetes. But then we can see that there are some others. For instance, on the top, we have, we can see big B and this project is interconnected by some developers that are in the middle to Kubernetes. So this means that there are different dots, different developers that are working in different projects in this case in these two projects. All of these interconnectedness that we can see right in the middle. That chaos is beautiful because that means that there are hundreds of developers that are working here and there. They are working in dark and flying CNI, Helm, core DNS, Kubernetes, Prometheus, and they are all working around. So this mess around here is this interesting part about open source because this means they are all collaborating and working together. And the mess is what open source makes open source so lovely. So if we look at some of the key projects here that really impact GRPC and the larger, more populated projects perhaps and active projects maybe with more people in them, a few of them to point out Prometheus and Argo and GRPC as well as Kubernetes. So there is a number of them here. And then you might want to talk a little bit more about these highly interconnected, the chaos here. Yeah, so as I mentioned before, so we have, we can have developers that are working in more than one project. So if they are working in, for instance, in Prometheus and Kubernetes, we'll see one pink dot which is connecting both both projects. We've been doing this analysis for OpenSafe as well. And we saw that there are a lot of developers that are working for instance in OpenSafe project, but in Kubernetes as well. And of course it makes a lot of sense because OpenSafe is a distribution of Kubernetes with some extra vitamins, right? So we have, all of these interconnections is taking place because all of these projects are really, really interconnected. So we can see, and this is another important thing to discuss for today. I don't have the context knowledge of this discussion. I'm only bringing, let's say, the tooling and some skills in terms of producing all of this data. But the thing that I can see here is that we have some projects that are kind of interrelated between them. As we can see, OpenPoly CA, so there are a couple of projects. CRIO, Hardboard Rook, they are kind of related to GRPC or cloud events. While right in the middle we have CNI, Dragonfly, Help and Core DNS. So this means something, this means that at least developers that are working in those four projects, CNI, Dragonfly, Help and Core DNS, they are working at least in more than one. And that's why these four projects are so close, right? And they are again in the middle of this interconnection because they are highly related to Kubernetes, Prometheus, and to the projects that we can see on the top, the projects that we can see at the bottom. So this means this interconnection. And on the other hand, we have projects that are less interconnected. So we can see it as Falko, LinkerD, or you at the top of the slide. Those projects, what happens is that there are not that many developers that are working in this project and in other projects. From a sync perspective, they are probably, you will detail a bit more here. But from a sync perspective, this is telling me that developers are not that aware of what's going on in the rest of the projects, at least from a technical perspective in terms of... From a data perspective, they might not be as highly connected to the other projects. And so that might be where we need to bridge some communication. It's not always true, and this is where having domain knowledge of the communities here. Just having the data isn't really enough and we'll talk a little bit about that later. But the thing that I wanted to highlight here is that identifying some of the people who are the connectors between these projects has been really helpful for us from the OpenShift perspective. And we add a layer on this when we do the analysis for OpenShift. And so we can see where OpenShift developers and engineers and participants in these communities are too. And so you can filter this by organization, change the colors. There's some really useful bits in here that have helped us a lot, especially for example when IBM acquired Red Hat and I all of a sudden had to realize who were the IBMers in our communities too. So there's some really cool features that have really enabled us to connect with our communities better here. And one of the things that I think is key here is this concept of betweenness and centrality between project. These people and these personas and developers are the ones from my perspective who are bridging, who are able to bring, you know, this release is going to impact on this project. These features need to be in here. And that's where you start to see pull requests coming back and forth between projects and cross-linking. So and trying to find those people who can help you understand the health and the level of engagement or maturity of a project. So maybe BALCO just because it's on the outskirts here and not interconnected doesn't mean that it's not healthy and immature or an engaged community. It just means this is where it needs to be connected. So that's where domain knowledge is here. So you've got to be a little careful about that as well. So one of the things that we've done once we start to identify who these developers are and who these people are is to break them out by personalities. And that's really helped us or personas rather to untangle all of these community relationships. And we only have a few minutes to do this talk. And this is probably a deeper, deeper talk about how to start looking because you can look at this data in over historical. You can see get notified when new people join all kinds of really cool things. But starting to look at tangential personas, people who are not connected into the city, into projects, connector personas, newcomers. Who are the project leads and organizational personas when a new organization joins into a community. So we see this from our perspective. We have a very strong end user community in Kubernetes and with OpenShift as well with over 3,000, almost 3,000 organizations deploying OpenShift. Our end users are now actively a lot of them participating in the upstream. So we're watching as they become engaged. And that way we know where we can connect with them, get feedback from them, help them, coach them, do all the stuff that we do with inner source commons and chaos and all the other foundations to make sure that these personas are nurtured and engaged with as they want to be. So that's really been key. So I know this is a very short talk so we could go on all day about this. But some of the things that these data, having this data and being able to use these tools has really been essential for me from a community development point of view and for Red Hat in some of the upstream coordination. Identifying people who are key connectors has been like a godsend for us just at the very least. That's the only thing you do with these tools is to identify people you're doing good. The other thing is that I think everybody has recognized that it's more about cross community work versus single project focused. That's been historically a community manager in the past would be really focused on getting people to contribute to their project and here we want our end users in the upstream. We're in the upstream and all of these projects too because we pull them and into our products and our offerings and our distributions. The persona analysis really helps us explore the structure of projects, make sure that there's stability, that there are newcomers coming in, that we're keeping them engaged. The healthy relationships really matter a lot here, making sure that they have the content and the educational material that they need, the documentation. That they have the CI CD and build processes that they need to be effective in developing and creating these new projects and new and putting new innovations into them. As I said a couple of times, domain knowledge is imperative. You can give a data tool. I don't know there's probably some metaphor there are data to lead a horse to water but the water horse doesn't know that it's water there. You don't have to drink it or whatever but it's like you really have to understand the technology in order to really take this because you may draw the wrong conclusions. And as always the data matters. That's really one of the key things that's been working with Vitergia, being able to aggregate, use the sorting hat, get clean curated data and great tools to work with. That's been amazing for us over the past few years. And people always ask what's next and now that IBM has acquired Red Hat. Like I'm hoping to get some IBM Watson tools and apply some predictive analysis here. I think that's going to be key because for me, this is being a Canadian, I always throw in some hockey metaphor but the idea is all this historical analysis is good. But skate to where the puck is going not to where it's been has been really the credo here because I think as we watch new projects enter these these ecosystems. It's really gets them on our radar as soon as they arrive as soon as the sandbox projects come into the CNCF. And even earlier as you can watch deeper into GitHub as new projects arise even pre sandbox stage stuff. So it's been really very, very useful for us. You want to add a few more words in there out of conclusion, Daniel, and then we'll do some Q&A. Yeah. So we have this sentence from Lord Kelvin that says that without data area, you are just another person with an opinion or that without data you can you cannot measure so you cannot improve right. But on the other side of things, given the tools that we have nowadays, big data and so on, I would say that without an opinion, you are just another person with data. So we need to play with the balance both both both sides. Great. Well, let's see if we can advance this slide one more time. So now we're on to the Q&A and hopefully you'll all enjoy open source summit Europe and we can connect either in the chat rooms or in the Q&A here right after this talk. And here's how to get ahold of us and we're happy to take your questions now. Thank you.