 All right, we can get started. Hi, my name is Callie Dolphy, and I'm a data scientist in Red Hat's Ospo. Sean Gaggins, I'm a co-maintainer of the Augur project and a member of Chaos. Yes, and this presentation is gonna be about a metrics models with the robust hardware software of 8nought and Augur. This is honestly kind of like a celebratory presentation for us, we've been working probably like over a year now on trying to get a functional environment where both of these projects work together and it's honestly made both of the projects better through this, so yeah. And so we're coming in here with the whole, this was something that Sean had brought in from kind of the overall thesis of today and looking at the state of Chaos and how to implement the metrics, not just coming up with what the metrics were. And that's honestly where we started working together of I was coming in from a data science perspective of trying to see how could I get access to structured data to be able to do the data analysis using the Python packages and other environments that were native to education with data science and he brought in that structured data. And so that transitions well. So we look at Augur as a path to data science and really if you wanna think about what Augur is, it's long had a really crappy interface but does data engineering really well? It's long had a crappy interface and does data engineering really well? So you end up with mountains of data and we've spent nearly six years working through all of the anomalies that occur when you're mining open source software data. We're talking about weird character sets included in messages, everything that can possibly go wrong with data goes wrong with data in open source routinely and so we've spent six years working on that and putting the data into a structured relational database and then validating it against platform metadata. And the result is that we have data that we think we can trust, that we have a high degree of confidence that we've worked out all the bugs and we have a way of identifying the new ones as they occur because believe it or not data anomalies don't stop. You never find them all, there's always new ones. Somebody is always inventing a better problem. And one of the key things we've done in the last year working with Callie and James who's also here is scale up Augur. So previously Augur was pretty slow and we've introduced a queuing architecture where we can literally make hundreds of requests at once and instead of taking two weeks to populate a thousand repos, we can do that in about two hours now and so over the last six months we've created an Augur instance that has over a hundred thousand repos in it with all of the data that you can imagine you might want about a repo. So if you think about what that means you need large scale data for insight. So if you ask ChatGPT the question please write me an essay about Canada. Thanks. What makes the result stunning? Why is ChatGPT, for example, so cool? Is it the algorithm? Have we advanced our machine learning algorithms that much? Or is it that we've created really good training data sets? And so one of the things that we've done with Augur is by gathering data from the scientific, academic, corporate, non-profit, non-government organization spheres we've created data sets that are garden to focus on what we know about open source projects in each of those spheres. And so in doing so we leverage hundreds of thousands of messages or millions of messages garden by the domain so that we can compare the communication that occurs in a repo against all of these different domains distinctly. And now I'm gonna pass it back to Kali. Yep and this is where Project Aspen comes in this is looking at it from the data science perspective. And when we're looking at how Augur fits into this that's those first few phases of what you would call like the data science workflow. If you need to collect the data, you need to clean the data, it needs to be structured for you to actually be able to do that analysis and try it in true that is where so much of the time and effort goes from. So having that accessible has made us be able to build this project which has two components and eight not which is the one that we'll be talking about mainly today. This is a cloud native container deployment strategy and it uses the Augur database and it has the validated back in data and it allows all of the data scientists or other data enthusiasts folks to use a Python native data science tool chain. So if you're using pandas, if you're using any type of modeling, those can be now used into making these visualizations and connects to the data easily. And this is a good point where I'd wanna shout out James Kongstall who is the reason why this like cloud container platform is so functional and has gotten us to where we are today with that. And the second portion of it is the repel portion of the project and this is the open research side. And our current focus right now is on the developer social network analysis of open source ecosystems. And if you wanna talk to somebody about that on Drilla's here as well and she's been doing a lot of work on the repel side of the project. So when we're looking at how Augur and 8knot connect the like how we were going through before Augur is our back end of the 8knot dashboard and we're able to compartmentalize the different processes of the project. So then the data engineering is in one place we have the all of the software engineering and making it to where we have all of the workers that get that data into populated visualizations is all compartmentalized. So then the data scientists themselves only have to worry about doing the visualizations and all of this works for any combination of repositories as well as allowing for user inputs. We've been focusing a lot lately in chaos around metrics models and I wanted to make the brief point that we have the data available for all of these metrics models with an Augur but whereas grimoire lab has built over 70 dashboards and sigils we have four. So we have an opportunity with the data to start building out more of these metrics models within 8knot and Kali am I doing the demo right? Okay, Sean doesn't know how to operate a computer. Is it here? Yeah, I want to try to switch to the browser. Sorry. Oh yeah, computers are very difficult. Especially the latest Mac upgrade has made switching monitors difficult. If any of you have a Mac, you know what kind of pain they've instilled with their new upgrade. So now it's like that. Okay, so this is just one of the dashboards. It's the overview dashboard for a series of projects in 8knot and it's the ARM embed which is a real-time operating system. And let's say that I want to gather new data. Let's say that I want to get data. I click the Augur login signup button and I create a user just like this. I'm gonna put the mic down and here you're prompted to just give Augur permissions and you'll click your little authorized button and it'll take you right back to 8knot. And once I'm in 8knot, it'll take a little bit of a minute to refresh here. And once I'm here and there's some navigation things we need to work on, but if I click Sean, I will get back to my profile and I can click repo tracker. And if I click repo tracker, I can actually create a new group name. Like for example, let's create a group name called Microsoft. And I've added the name to Microsoft GitHub organizations and I am gonna put them in the Microsoft group. I'm gonna add them and this will take a minute because these are very large repositories. And so while that's happening, I'm just gonna go back over to 8knot. So what's happening in the background there is we're making a series of GitHub API calls. If the repositories that I'm adding as my new user already exist in the database, I will automatically instantly see that data. If the repositories are new, it will take between two hours and two days depending on the size. And I think Microsoft has tens of thousands of repos. So perhaps it'll take a week, but all of that data will eventually be added. And then the next person that comes along and creates a user and adds Microsoft as an organization will be able to see that data instantly. And so obviously we're gonna work on tightening this integration, but the idea is that everybody who wants to see data, if it's already gathered, they don't have to wait. And so whatever data you wanna see, you don't have to wait. Anything you'd like to add? No, we can transition over to more of the visualization side of things. Nobody in New York saw it, did you? Oh no, I can do it from here. Oh no, you have a few, the new. Oh yeah, okay, so we've talked a lot about metrics models. Where metrics models come from are OSPOs have analysis that they use. And so all the metrics models that exist inside of KS, and there are 10 that are released right now, have emerged from OSPOs who have collections of metrics that they wanna see together. And so these are eight, what I would characterize as new metrics models that what we'd like to do is create them as formal metrics models inside of KS because they already exist in the eight-knot tool. And like existing metrics models, they're driven directly by OSPOs needs. And you can see there's lots of little OSPOs in the world. And many OSPOs enumerable needs and they evolve. So there are new metrics models forever. There is no end to the metrics models just as though it's turtles all the way down. It's metrics models all the way down. These are a listing of the metrics models that exist in eight-knot already. And you can see that we've created issues for each of them. So we pointed those issues at each of the metrics models that already exist in KS and made a list of making them inside of eight-knot. So this is one way that any tool I think can help to implement the metrics models that already exist, create an issue, get somebody to work on it. Yeah, that definitely describes pretty well like one side of what contributions are for the eight-knot project. Contribution can be in the form of coming in with an idea of a well thought out metric or visualization. People love to focus in on the technical side of things. And as someone who's made a lot of the visualizations that are on eight-knot, the time and energy that needs to be spent into the actual idea and the methodology around the metric is a lot more than the hour or two that it takes to actually implement it in the Dash Plotly platform, which I don't know if I've mentioned that yet, is actually the tool base that we're using to build our dashboard. And so I wanted to hop over a little bit to some of the different visualizations and kind of two different things that we've started to really focus on. And one side of it is taking a lot of the more well-known metrics or visualizations and taking it a step farther to be able to allow for more insights. One of the ones that's become a really popular use has been the contributor growth by engagement. This is one where you can start to look at and you can choose for whatever community that you're looking at, what time you would describe as a contributor being drifting away from a community or away from the community. And you can start to see a little bit more into, and this we can, I can just zoom in a little bit. You can start to see how, what is the consistency around your active, drifting and away contributor base? Your away is always going to grow bigger. There's always going to be more and more contributors total over time. But we found a lot of insights to being able to come of the variants that is in that away and drifting. So this was kind of one to again, exemplify taking a very commonly used visualization of contributor growth over time and adding a new dimension of it that we're easily able to do because of the Python tooling and being able to use things like pandas and dash plotly. And another angle that we were starting to look at is there's different questions that you are not able to answer directly. I know probably everybody in this room has been able, has been asked, okay, what is the exact company breakdown within this project? And that is a question that I don't think is really capable of being answered directly in a general case. And so the idea, another idea we've gone about doing is how to look at the same question from multiple different perspectives. And so each of these visualizations, they look at the company affiliation or looking at specifically at domains from a different view and each one of them are going to have strengths and something that it shows well and some of them that doesn't show the whole picture. And so hopefully to be able to have from a different, a whole set of visualizations around the same question that you can actually get a holistic view and be able to answer that question as well as possible. And so, and we can, yeah. One point I would make is we can't tell you the company because we don't have that information only the company knows who works for them and what their email is. So unless you're using your company email domain we can't determine that. However, anybody who would deploy this toolkit has a list of the employees and their emails and their GitHub IDs and could break this down further with their own custom dashboards if you have that information, which we don't. Yeah, and that's one of the great things about 8.9 but we hope this is being used for is that this is we're not going to produce any visualization that has individual identifiable data. But if you have a specific company or project need where you would like to do that you can easily fork this project and make those visualizations that take in that individual and very specific data around your projects and it's all set up and templated well to make it where that effort around that is much smaller. And so let's go back to here. Oh, that was a call to action. But one thing I wanted to example, look into is actually what a contribution looks like from a visualization standpoint and how the compartmentalization of the app goes into looking at it. But we, I think we're getting up on time and luckily we've kind of covered most of it. So, yes. All right, thank you everybody. How are you going? How are you? Well, thank you for, well, first time I did a, I've done a long trip for a while after pandemic. So, yeah, nice meeting you, seeing you all in person. Yeah, so, well, so thank you for the presentation and all of this, so it's been great. Now we change to the other piece of software that we have in Chaos, which is Grimoire Lab. So it's been quite interesting because the last time I went to one of the Chaos Con was in Dublin. So that is September of last year. Back there we didn't talk to speakers but we were all discussing about collaboration. Today it happens that we are all talking about ecosystem. I don't know why, but my talk is about ecosystems as well. So some quick updates on Grimoire Lab. So we are now fully supporting OpenSearch and we are now starting to have a, you know, technology ready for production in OpenSearch. So you can use the technology now, is test and retest it, so it should be good to go now. It's safe to use, let's say. There are really cool features that we can mention from this tandem, which is the alerted system, which is something that people always wanted to have in Grimoire Lab, specifically some way of being sure that if something happened in my community, in my ecosystem, my project, then I get alerted in somehow. So I may receive an email or some similar stuff. SQL query, if you know about OpenSearch, this is JSON based query format, which is a bit tricky sometimes. So more people are used to deal with SQL, so then you can query with it. So this is coming to the technology as well. And the other one is anomaly detection, which is, again, thanks to the OpenSearch people that are bringing, you know, this is mostly for tracking logs and so on, but because we are talking about activity and so on, if we think about the engineering processes, then perhaps we have certain predictability on how we are producing software, right, in the software production chain. So having certain pointers to certain anomalies in that process, that's really interesting because basically from an engineering perspective, we are able to go to those pointers, look for the bottlenecks, look for issues, and try to solve them. And again, this is already in production, so that's good. Now that I see Ildic around, this is the Open Infra Foundation, this is one of the very first having this, so, yep. More things. So most of the updates we've been having lately on Grimoire Lab are on Shorting Hat. So Shorting Hat is the centralized piece of software that you can use to track all of your affiliations and identity information. And this is GDPR, ready? And over, basically, this has, this is producing more value to you the more time you use it because basically what it's happening and it's part of conversation you had before is you need to keep updating affiliations, you need to keep updating identities, people move from company to other company, there are new pieces of infrastructure that you are using, right? So then all of this complexity is managed by Shorting Hat. And then Shorting Hat specifically is now supporting different backends, specifically GDM or Open Infra ID, but there is now support to produce more backends. So if you are using your own technology to track all of this information, then basically you can produce a piece of software that is integrating that with Shorting Hat and then with the rest of the technology basically Grimoire Lab. Shorting Hat hierarchies. This is basically when used internally in large corporations. So you belong to a company and then you department and then a sub department but then perhaps you have certain skills as Python developer and certain things. So all of this can be tracked. So basically you can check activity by certain departments or so. If we move to the open source world, that means that you can track well different companies, perhaps hierarchies at a different project level. So it's about dealing with hierarchies, whatever that is, right? And then multi-tenancy, so basically it's about having everything in one place which is cheaper and then serving to different customers. So very last updates, you have the URL here for the last release, 0.9, in case you are interested. What you see in the image is basically the architecture of everything. So as Son mentioned before, there are several data sources supported. So 30 plus in this case that are ingested by Percival. So then this is stored in elastic search, open search database. And then it's later enriched with certain pieces of information as affiliation or identities or so on so forth. And then there are many ways to consume the information. You can go for Jupyter notebooks. You can go for out-of-the-box task boards existing in the platform, et cetera, et cetera. So it's basically about data consumption, right? It happens that with Grimoire Lab, you are closer to the business layer of software development analytics, right? So which is community, activity, people, performance, and all of these layers. So a list of supporting data sources, nothing new, look in Google for Grimoire Lab, and then you will find all of these. Yep, more things. In case you are interested in referencing, we came as well from academia. We started all of this like back in 2012. We keep, of course, producing academic papers. So this is something that you can use for reference and have a look at Grimoire Lab in a very detailed way. So with this, and given our background, well, yes, we can say that the, or our goal is to produce high quality, created data sets, all of these affiliations, discussion, consistent reporting, and supporting all of these, so yes, for reference. Okay, some new things that we are working on right now. It's about this discussion of metrics in action. Might be part of the discussion for metrics models in reality. So we are trying to formalize the discussion of, okay, can we have certain thoughts on ecosystem risk analysis? So what does it mean? We've been discussing about ecosystems, right? During this afternoon, basically. I'm using this Georg's comment that we had. It's now in a piece of paper, right? I think it's really good. So basically, if you think about the software production chain, if you try to fix a back in production, that's really, really, really expensive, right? The earlier you're able to detect a problem, the better. But what happens if we are able to predict certain issues even before those happen, basically? I mean, if we are able to predict vulnerabilities using social analytics, which is what we have now in mind. So there's specific academic research on this. They were comparing, this is not done by us, but they were comparing that what if we use, this is Microsoft research by the way, what if we use specific community related metrics to compare to the usual code as males that you have for static analysis and so on. So it happens that the result of this is things as number of engineers working in a certain piece of code, the more there are, the more buggy that might be. Senior developers live in the community, areas of the code that are not modified for a certain period of time. So all of these together so that they were, they had a better prediction for further vulnerabilities at some point. So what if we go up to that point? And I think that in chaos we can play this. We can try at least test the waters and see, is this real? Can we do this? So part of the ecosystem risk analysis is this discussion, right? So can we bring more trust to open source projects to be part of all of this? And perhaps some of the other questions we can have is if we can have a full overview of my whole open source ecosystem risk in a really quick glimpse. So moving forward, basically, this is of course based on the risk working group. This is work on, I mean work on the risk working group, specific a blog post done by Geo here and Luis from Viteria, and the thing is if we can use this to move forward. So there is this URL that you can use risk-analysis.viteria.net, this is fake data, not the very beginning, the first panels, but basically you can go there. Ideally, what we have, and let me go there so I can show you this, it's here, okay. So what you have is our different metrics, right? You have like a factor that you can normalize information against other communities, other ecosystems, other projects. So in this case, we analyze group model up community. So it has a risk factor of 6.9 out of 10. And then there are different metrics that you can have in mind, right? So for instance, the elephant factor is, okay, if you remember the definition of this is the distribution of work in the project across organizations. The more percentage of work done by one company might be more risky, right, from the sustainability perspective. So then you can have different metrics around this, active organizations, active contributors, pony factor, new contributors, which are all related to these social analyses that you can do to predict vulnerabilities at some point, right? And then ideally, then the next step is, or the next question on the table is, okay, we have the project, but what's next? What about the dependencies, right? So then that would be the next step. Again, this is now, we are making up data here. So this is like a mock up and so on. But we would like to know about your thoughts here. So we have project, but then the project has dependencies, so okay, what is the risk of those dependencies? Similar analysis, same metrics. And then at the very end, what you have is basically your whole open source ecosystem, whatever that means for you, because it's not the same an ecosystem for an open source foundation, an ecosystem for a company, an ecosystem for Grimoire Lab, for instance, right? Or for chaos. So those are different definitions of ecosystem. So what if we are able to measure and have a quick view of what this is? Again, this is fake data, so forget about the names on the projects. But if we are able to have somehow a quadrant of all of this information all together, playing with two, three variables, then suddenly you have in a quick view what's going on from a risk perspective, right? So what we are trying to produce here is, we are playing with two main metrics, the risk as core, which is what you saw before, the 6.9, then here is at the X axis, the percentage of your dependencies at risk. And then basically the rest of it is about putting the different projects in the quadrant. So then we can split this into four main areas. Then the bigger the dot, the more dependencies you may have. And it's about, okay, what are my really high-risk projects or communities that are part of my ecosystem? So in this case, you can go for this red one, right? So this would be my projects that are at risk, plus with a high percentage of their dependencies that are at risk. Then we have the opposite, which is this one, right? The green one would be, okay, low risk, low risk at my project, no percentage of dependencies at certain risk. And then the rest of the other two are about playing with those. So then the question here is, what does it mean that these two projects here are at risk for my specific ecosystem? So that depends a lot on what you have on production, how you're using that software, et cetera, et cetera. So then you need to balance this with whatever you have internally. So it's not the same that you are a bank and you are using a project at risk in production, providing services to your customers, that if this is a local internal tool that you can use at some point. So the risk is way different, right? So it's about balancing this somehow. Yeah, and going back to the slides again, just a second. So we can have this discussion now after this. So this is the discussion. The point here is that we have projects, right? And then I enter into the discussion of the ecosystem. But me, if I were a company, if I were a bank, if I were an insurance company, if I were whoever, basically I'm perhaps mostly interested at the ecosystem level, right? Because ecosystem is, okay, what is what matters for me in terms of critical software that I'm using coming from the open source. You as a corporation, as an enterprise, you have certain procurement process to onboard your own vendors, right? That means analyze economic situation, analyze even the background of the employees and time. So you're asking as large corporations many things to all of your vendors. But when we go to the discussion with open source, it happens that some companies are just taking the technology, are just consuming the technology. But that company is not doing any kind of risk analysis on those open source projects. So there is a gap between how consistent you are with your vendors and how this company is consistently not being consistent with the open source projects. So then what if we have a certain way to analyze that risk for those projects that we are directly consuming? So it's part of this discussion here, right? Because all of you with your vendors, you want them to be healthy, right? To grow with you, you want to grow together with them. So what happens with the open source projects? So it's about having this conversation with those projects because if we see that there are a couple of them, or maybe 200 that are at certain risk, probably the next step is, okay, what do I do here? I probably need to talk to them and say, what do you need? Because we had this discussion before like, is this a matter of money? Do we need to put money on the table here to save them? Do we need to, right? So this is part of the discussion to have later at the end of today. But at least by having pointers, because we are talking about at the level of thousands of dependencies, it's really hard to have this. So we need to have a data-driven approach to go to focus or to try to guess what's going on in my open source ecosystem. And risk is one of those aspects that we should be measuring. So, and then we discuss about the importance of the ecosystem, right? Which is why we are all talking today about the ecosystem. So we need to discuss about what risk means for open source foundations, right? It's different than the way that risk means for an open source company, or it means different for care, right? So from an open source foundation perspective, I might be interested in being sure that my projects, the technology that I'm producing is at certain quality level so then others can consume that in an effective way, right? If we think about a company, then the company would like to avoid at all costs any kind of risk related when things are going to production. So those are different ways of looking at risk, right? Definitions of risk. Of course, it's the important of understanding your ecosystem and its participants, basically who is who. And then going to the last one, what we've seen over the years is that the more interconnected, the different projects that your ecosystem are, the more likely that they will be sustainable. So going to the question at the very beginning of what might be a good metric for this interconnectedness might be one of those that we can think about, right? Which is related what you were discussing before, both of you, about the social networks. So what do we have here, right? Yeah, and this is it for today. So we can keep discussing. Thank you for your time. It's great to see you all again. Thank you. Thank you.