 We're pleased to be here today at the OpenShift Commons Gathering, and the topic today is data science. And I'm Diane Fedema from Red Hat. I work in the AI services team. I'm here with David Cantor and Peter Madsen. David is the, excuse me, the executive director of ML Commons, and Peter is the president of ML Commons, general chair of ML Perth and staff engineer at Google. So Peter and David recently launched ML Commons, and we invited them to provide some background history on ML Commons and ML Perth. And I want to say that Red Hat is really excited to be one of the founding members of ML Commons. So to get us started, tell us a little bit about your backgrounds and some of the work that you do. So I'm Peter Madsen. I run ML metrics for Google. I'm interested in measuring all things about ML. And before that, I studied compilers at Stanford. I worked with a startup called Stream Processors, and we've had video for a while, lots of different opportunities to try and make complex things go fast, as it turns out, that used to be in a trinal need. So I decided to be trying to do that for ML and also try and make it better. So as we push forward, thanks for the ML Perth and the ML Commons. David. Yeah, David Cantor. And so, pre-ML Commons, I spent a lot of time in computer architecture. I actually started a microprocessor company that was sort of doing a fusion of compilers and hardware design to exploit more single-threaded performance. And then after that, I ended up consulting with a number of companies, one of which was Cerebra Systems, which is now, like Red Hat, a founding member of ML Commons, and that's sort of how I got involved in this. And I actually have a little bit of background in benchmarking, which kind of came in handy and is part of the reason why I got involved. And it's just, you know, it's very exciting to be able to build this kind of an open community. And we really do appreciate that the role that Red Hat is playing. Thank you. So, I don't know how many users are aware that ML Commons originated in ML Perth. What led you to start ML Perth, Peter, and what were its goals, and how did it evolve into ML Commons? Sure. So, about three years ago, we were looking around at ML and, in particular, ML hardware. In Google, I'm trying to understand how fast were different options. And we decided that we really needed to have a good ML performance benchmark. And there did not seem to be an industry standard solution for those. So, we rounded up a set of usual suspects. Anyone we could find that we found had done the hard work. So, folks like Greg Diamas from Baidu, who did deep bench, Stanford Don bench folks, Matai Zaharia and Peter Bayless, and the Fathom folks from Harvard. And I got everybody in a room and put forth the challenge, like, should we try and come up with one benchmark or could you use to make sure training performed? And everyone thought that was a great idea. So, we came up with a set of rules, brought in a bunch more folks from industry, brought in players like NVIDIA, Intel, startups like Stubbrous, which is how David got sucked in. And the benchmark really took off. We had our first set of rules out in the middle of 2018. And then results by the end of that year, we've had several rounds since then, 2019 was a big year of growth. We got it to inference, we got it to HPC, 2020, we continue to expand. And we also started ML comments. Sort of the driving function behind that was we were looking around for a home for ML, wanted to put in a non-profit organization. But we wanted something that was engineering focused and ML focused. Open engineering and ML. And we couldn't find that particular combination. We could find large organizations like Linux. So, we're very focused on open engineering in general. We could find some that were focused on ML in particular, like Neuros, but they were more event oriented. And so, we decided to start one. We wanted an organization that that was their reason for being, was to try and come along and make ML better. And we put ML perf into ML comments. ML perf is still very much growing and growing, but it now has... Oh, we also looked at the field of ML and we feel like it's a very young industry, right? It really has a tremendous amount of needs to mature as a field. It needs the same things that drove sort of the industrial revolution. Great ways of measuring things. It needs good raw materials. Aided in the case of ML. And it needs good ways of making things, standard ways of making things. A shift from doing things in your basement to assembly line production of high quality. And we wanted to see whether we could form an organization that would answer that call and try and provide those things and really move the field forward. Yeah, that's great. So, that's sort of the driving motivation. And I think we kind of ended up with three key pillars that we like to talk about. And that would be the benchmarks and metrics, which we've talked about ML perf, as well as building large open data sets, which we think are another key ingredient towards really democratizing technology, right? And the same way that open source really has enabled and fundamentally transformed like the art of software, whether software as an art or as an engineering, it's just utterly unrecognizable compared to 30 years ago. And sort of the analogy is that data is sort of that same raw ingredient that you need to start building up machine learning. And the more large and open data sets we have, the more folks are able to extend ML capabilities and use them in products and extend those benefits to the whole world, right? And the third pillar is best practices. And I like to think of this as removing friction, right? And or perhaps the transition from sewing your own clothes to having an abstracted assembly line where there's a real flow. And today with ML, there's a lot of things, whether it's model portability or just even deploying a model is tremendously high friction. But if we want ML to become pervasive, we need to drive those sources of friction down so that maybe in the future, doing things with ML is almost as easy as grabbing a library off of GitHub and then looking at the comments and maybe asking some questions on Stack Overflow about gluing it together. Like that's a future we would love to go towards. And we are very fortunate that when we went out and started talking about this vision, you know, it really resonated with a lot of companies. You know, Red Hat is a founder. We've got about 39 companies that are founders and a total of over 60 members. So some of those are individuals like myself or academics associated with universities. And so we've really built this just tremendously vibrant community to focus on advancing innovation in machine learning and kind of extending those benefits through all of society. And it's very much organized in the principles of open source, right? We're very open. We like to move fast and iterate. Okay, great. So are most of those members then hardware companies? Can you just give me a little bit of a breakdown there? Sure, yeah. So we absolutely have a lot of hardware companies. You know, Peter named a few like Intel and NVIDIA as well as startups like Cinti and so forth. But we have a number of cloud services companies and software companies as well. We really see this is a big tent where there's a lot of folks who can play. You know, to name an example of sort of a more purist software company in some sense, VMware is involved. There are a number of ML software companies and then a lot of cloud providers who provide computing services in one fashion or another. As well as, you know, sort of very ML focused companies. There's a couple of startups that focus on replicating experiments and things like that that are engaged. So it's a really lovely and diverse community and also across all GEOs. This is both a blessing and a curse. For those of you in distributed organizations, you know the challenge of finding a time that works for folks in Asia, folks in Europe and folks in America, which is there is no such time. But, you know, it's great to have such a diversity of participation. Can you give me some examples of projects that are going on in ML Commons? Absolutely. So I'll probably start off with, you know, one or two. So the ML per benchmarks are pretty well known, but one of the things that we are doing is trying to sort of grow the footprint and move into, you know, some new areas that need attention in terms of ML. We started, as Peter mentioned, with training. I got involved and helped to lead doing inference benchmarks. And then one of the things that we branched off to do was to start focusing on mobile phones and ML in that context. And then there's some efforts that we have around, you know, sort of the internet of things in tiny devices. That's one way that we've been expanding with different projects in the metric side. And then one of the things that I think, you know, actually, you know, brought us together, you and me most literally, was ML Cube, which is one of our best practices. And that is a set of conventions around containerization that help you sort of abstract the machine learning away from all the other pieces of the infrastructure. And, you know, I like to talk about this in terms of both portability and reproducibility. And one of the examples I give of how this can help is when I think about a game-changing innovation like BERT, it was first published as a paper by Google. And there's probably some code in TensorFlow. But if you wanted to wrangle that and try that with your customers, you might spend a month or two doing that. And, you know, the vision is that maybe one day we can get that down to a day or so, or less, or maybe even hours, so that, you know, if you want to use an innovation at Amazon or Facebook and try it out on-premise or in a different cloud altogether, that becomes frictionless. And I remember, you know, one of the first things that brought you together with us was you were working with some of our benchmarks and trying to get them to work on Red Hat. And, you know, it was a bit of a struggle. And so in some sense, it was born out of that need and desire. And, you know, we also have some dataset projects and I'll let Peter talk about those. As David said, there's three big pillars for us, which are benchmarks, best practices, and dataset. I think in many ways, datasets are the new code. They are the way you express what you want your machine learning product to do. The models are in some sense a lossy compiler for them. And one of the key kinds of datasets that really drives innovation in the field is public datasets. You think about what ImageNet has done for the field, right? That costs something on the order of $300,000 to build and arguably it's created modern machine learning. We can't build performance benchmarks without good datasets. You can't do good academic research on anything without a good dataset. And a lot of the datasets we have now that are really best for their task. And then we're kind of created haphazardly, you know, an academic group needed something specific. They created the dataset and then moved on. And there's a dataset out there. Usually, you know, a very modest size compared to what's actually industry, often under restrictive licensing terms. And it's not growing and evolving with the field. And so what we would really like to do with ML Commons is create a center of excellence for public datasets. A group of people who are really excited about making sure there are good public datasets out there that are growing and evolving to suit the needs of the field. Both actual datasets, for instance, we just announced the people's speech. The largest publicly, or soon to be the largest publicly available speech dataset by order of magnitude, that includes a diverse range of languages. I think it's over 60 languages. More diverse range of speakers than what's available now. We really want to push that forward because, you know, that makes speech-to-text technology accessible globally if we can get this right. We're also looking at, essentially, datasets for recommendation systems, which are incredibly important industry, and, essentially, a framework for doing very privacy-protecting, medical datasets for accuracy validation for people looking to say, will this model really work in clinical practice? We've got a wide range of projects we're looking into all around the sort of central theme of make good public. Hey, well, that is great. So if someone in the audience right now is really interested in getting involved and, you know, in one of these areas that you've discussed, you know, I'm just wondering where do you need contributors right now, and how could they go about getting on board and helping out? Yeah, so, first of all, you know, like most open-source communities, you know, we really love folks who show up. And, in fact, you know, I just, to give you an example of that, I originally showed up to a meeting at, I think, the Stanford Faculty Club, one of our early ones, that was posted through a call on the comp.arcusenet, right, and I showed up and, you know, eventually I did so much good work that I got punished and they made me executive director, right? Take that. We are an extremely open organization. So, if you go to our website to mlcommons.org, there's a page about getting involved that lists out all of our working groups. You know, we talked about like three or four projects, but there's over 10 different working groups, you know, everything from focusing on low-power embedded benchmarks to logging to algorithms. And so, each of those working groups, we have chairs. Diane, you are, you know, one of the chairs for MLCube. And so, if you go to the page on MLCube, you'll get to see, you know, a bit about Diane and what the project focuses on. So, you can look through those and, you know, we are open to individual members and many of our projects are open source in nature. So, you know, you can stop by GitHub, sign the CLA, and, you know, if you see some bugs, we always love those getting fixed. And, you know, I think, again, like a lot of open source communities, it's something that, you know, you get as much as you get, right? It's the potluck model. And so, I think there are a number of folks who have kind of wandered in randomly and found that it is, you know, really fits their interests. Some of the folks on the dataset side are just phenomenally passionate about speech and this is just, you know, a really wonderful thing that just aligns with what they want to do. So, we'd love to see more folks getting engaged. But both from industry and academia, we have quite a few faculty already involved and we'd like more, we'd really like to maintain that balance in the end, just like. A community that's really open and wants to push innovation to go forward. Okay, fantastic. So, and then if you want to get, like, the links and things, go to mlcomments.org, is that right? Yup. Okay. Yeah, I think it's a great group of people, very friendly group, so glad I joined it. And so, thank you so much for being here today and talking to us. Thanks for having us. It's been great. Yeah, thanks for the invitation. And also, you know, thanks for all of your contributions to the community as well. You know, it's been a great partnership. Yeah, been a lot of fun. Thank you.