 Today, I'll be talking about the open source project, Open Dataology, which is the LFAI Sandbox project. In the last OSS, we gave, this is where we introduced the project, and we gave an overview of what this project's about. So I'll be doing a little bit of that once again today. But today's talk will focus more on what we have done with the project and what we have noticed about dataset licensing in general, and how we might fix it and what are the risks, like just an overview of what the landscape and what we can do as a community. So before I go further into that, I just want to give a quick introduction about myself. My name is Gopi Krishnan Raj Bahadur, and I'm originally from India. I have a bit of experience, both as a software developer and a software researcher. I first started my career as a programmer analyst, and then I worked as an engineer, software engineer, and I went on to Canada to do grad studies, and then I started working with Huawei. So that's kind of my background. I kind of come from both the worlds. So before I start the talk, I was asked to put this disclaimer by my company, all the views are mine, none of it's my company's. So there we go. Open Datology, this is the project. This is the logo I wish. I hope you can remember it, so that if you're interested in what I've presented today, you can come back and contribute or just work with us. And I just want to give a quick shout out to all the contributors before I start the talk. Okay, why all of this? As we know, availability of datasets is the single biggest reason why AI is where it is right now. Recently I read a paper about 10 years retrospective on ImageNet, which said that, and I quote, no data, no AI. And that's also ratified by many of the articles that has been written and the market size. But I call your attention to this figure here. So Andrewing, who is a very popular researcher in AI space, actually proposed this graph where the performance of the deep neural networks, which is kind of the reason why AI is popular right now, is directly dependent on the amount of data available. So data is everything. How do we get this data? There are only a few ways there are finite ways. One is we can go buy the data from somewhere or we can create our own dataset or we can use an existing publicly available dataset. And when we're using an existing publicly available dataset, we can go and get a dataset that is already pre-curated nicely like ImageNet or we can hit sources like Google or Flickr images or many different sources that are available and collect our own dataset. However, similar to open source software, all of this data comes with license. For example, some of the common, I just put some of the common datasets over there like ImageNet, cityscapes, everything has a license. It may look different and it may be in non-standard formats but they all have licenses and each license has rights and obligations associated with it. So rights is what one is allowed to do with the dataset and obligations are if you're going to do something with the dataset, what are the things that you have to do so that you can continue to enjoy the rights? Say for example, and this is a more detailed view of the same thing, for example, rights could be something like you're allowed to distribute the data. However, if you distribute the data, it must be under the same license, something like that. However, and this is not new, like everyone who's been using open source is kind of aware that licensing's been going on for a long time and people kind of have an idea of how to do open source licensing well. However, for datasets, there are unique sets of challenges that makes using much of the methods that we know from open source directly hard. So for example, some of the key challenges are provenance-related challenges, linear-related challenges and license-related challenges. So what do I mean by provenance-related challenges? Many of these datasets that I'm talking about are available in multiple different sources. For example, you could download, let's say ImageNet from either PyTorch website or from the source website or from many other GitHub repositories and each of them can slap a license of their own, which means that we need to identify the provenance of the dataset. And second, and one of the key challenges which is very unique to the dataset is many of these datasets are created at a different time. For example, C410 was created in 2000, was first published in 2009 and they say that they scraped their images from 80 million tiny images dataset. And 80 million tiny images dataset in terms scraped their images from Google, which was roughly around 2006. So if we go and check the Google's terms of use now, that is not the terms of use that apply for the data that they used at that time. And determining which time range we should use for this license is a very hard problem because no one clearly demarcates this is when we collected the dataset, they say this is when we are giving the dataset. And the other problem is there are many different copies of datasets that exist which always makes it hard. And when I talk about lineage, what I mean is, so if a dataset, say once again, let's take C410, it was derived as I said from 80 million tiny images and which was in turn from many different sources. All of this constitutes as lineage because the final license associated with C410 should respect all the datasets that it used before. So which means that we need to know what are the data sources that it gathered the dataset from. However, datasets being datasets, not everyone specifies where they got their data from. For example, once again, the most famous dataset in machine learning ImageNet, their paper specifies we got the data from multiple search engines or multiple sources, not exactly with sources, which makes it super hard. And the other problem is when we say a code has MIT license or GPL license, we kind of know what we can do and what we cannot do with it. However, when we say, however, for datasets, many of the licenses are custom, which makes what are the exact rights associated with it or the obligations that come with it very unclear. I'll show you some examples as we go through, but this problem makes it particularly unique because many of the licenses don't necessarily take care to specify all the rights. They just leave many things open ended, which makes it very hard to use those datasets, compliantly, which basically means you could use the dataset, but at some point in time, it could come back to bite you. So, what do we want? So, OpenDatology, our project aims, it's an open source project with LFAI, aims to identify these potential license compliance risks when using these publicly available datasets, especially to build commercial AI software because many of the companies also rely on these publicly available datasets to build their software. And we wanna see what are the risks associated with, identify what are the risks associated with publicly available, using publicly available datasets commercially. And we also wanna document it and store it in a central place and open source all our process so that everyone can be aware of all the risks and can also contribute because this is not a one person effort. So, I'll quickly go over the process that OpenDatology currently uses to identify the risks. First, if one were to use a publicly available dataset, this is, for example, if you wanna use CFR10 dataset to either commercially distribute it or release the dataset with the product, with the model or commercialize just the model output. This is the process that we propose that one uses. We wrote this up as a paper and it's under minor revision. So, the first phase, before using the dataset, we ask the AI engineer or whoever wants to use the dataset to extract its license, identify its provenance using the format that we specify and the format that I would show later, and the lineage. So, these three details are important because those would be the details that a lawyer would need to conduct compatibility analysis and understand all the constituent licenses. Say, for example, when we are doing CFR10, we first wanna extract the license, which is the license available on the official dataset. And if you'd read this license, you would understand what I was talking about about how custom licenses being very unclear. It basically says, cite the dataset if you intend to use this dataset. How you can use this dataset? What are the uses that you're allowed to? It's not very clear. So, once this license is extracted, we wanna locate the official data source and extract all the metadata associated with this license. Initially, we wanted to do this with a format, but now we are working with SPDX and trying to use the dataset profile of SPDX to collect the prominence details and all the details associated with the dataset so that both the prominence and lineage can be recorded. And then, identifying the lineage of the dataset is slightly more complex in the sense that we need to trace the dataset creation process, which will tell us what are the data sources from which this dataset was created. And then, we have to locate all these official sources and extract the licensing range that I was talking about earlier about when exactly did they collect the data from these data sources. And then, we identify the actual licenses. And once we identify each of the licenses, we give all the lineage prominence as a data profile to the lawyer. And the lawyer takes these licenses, decomposes it into a format like this. Why is the format like this important? As I was showing you earlier, there are many different, like the rights and obligations are not very clear. And there are many unique things that you can do with the dataset in context of an AI software. For instance, commercializing just the output of the model or if a model was built with this data, can reverse engineering on the model be allowed? Things that are unique to AI software that can be done with this dataset. So, and we don't want any ambiguity. So, we want to clearly record it using this format which is an extension on the Montreal Data License format, which is proposed by Yoshio Bungio. And we, so once all the licenses are decomposed into this format, the lawyer does in compatibility analysis and finds out what are the licenses what are the rights that are finally allowed and what are the obligations that are final too. So, if you look at this, after analyzing the license of all the C410 data sources, we, the lawyer found that tagging, distribution, re-represent or commercialization of the AI software built with this data is not allowed. And in fact, all the three scenarios were not allowed. So, this kind of spurred us on to analyze more datasets. But before that, we also wanted to make sure that our process can work for a dataset that are being curated from different data sources. The only difference that we would need is because we are curating the dataset ourselves. We would know the provenance and lineage. So, it doesn't need a process of going and finding it out. However, this is whatever, this is the process that we use in Open Dataology. This is the process that we came up with to ensure we are able to identify all the license compliance risks and we are happy to say it works. However, the part that I wanna really present this time is this. If there's one image that you can take home with, take this home. Data set licensing is scary. It's really scary. Like the guy is saying there. And we conducted an analysis on 221 publicly available datasets. And most importantly, we did not do the lineage extraction fully because 221 datasets is a large amount of datasets and lineage extraction is hard. You have to read through all the papers, identify exactly what datasets they use. So, we wanted to first do this without the lineage extraction and once we have built more of a community, be able to do more of this lineage extraction. The first thing, I wanna start you guys off slow. First scary part is about 107 of those datasets use custom licenses, which means that identifying the rights and obligations properly would need legal expertise. It's not like we already know. Second, among these licenses, among the 44 datasets that use the standard licenses, we found 81% of them use Creative Commons. There's 10% that uses MIT and there's like 9% that uses others, which is either BSD or ODBL. However, both MIT and BSD licenses don't necessarily specify a lot of machine learning specific use cases. Neither does Creative Commons actually. So, this is something that I wanted to go along the route of Jim today morning. So, at first, when we saw this, we were like, oh yeah, standard licenses are used, but then the scary part is they use Creative Commons license because as a research paper, the Montreal Data License Research Paper that I showed earlier says Creative Commons license is a great license for copyright specific and artistic works in particular. However, they don't have a lot of use cases. They don't necessarily specify all the rights and obligations associated with building an AI software. More importantly, the notion of derivative work, particularly for AI software, is very ill-defined. So, which means that Creative Commons, while it might be a good starting point, doesn't necessarily help. Are you scared yet? No? Okay. Let me go further. We analyzed these data sets, right? Like, in the first level, and we found that 65% of these data sets, you cannot use them commercially. Like, there are risks associated with using them commercially. Let me not say assertive statements. There are risks associated. And how about now? Okay, finally. Our results are only based on the first level analysis, which means that if we go dig all the lineage of those 35% of the data sets, we might find risks associated with using those data sets, too. Which means that none of the publicly-available data sets can be used easily or, like, trivially. However, without naming names, I would say most of the commercial offerings of models are built on these publicly-available data sets. It's out there. You can go check a lot of model stores. And these are all, like, I'm sure most of you are aware about the GitHub co-pilot issue and how they're being sued. So, I'm not saying anything might happen, though. And yeah, among the 65% risks, the key risk is commercial use is not permitted. Some of the other risks are, like, no licenses allowed. The source is untraceable. Or there are other scary aspects associated with it. But once again, 62% of them explicitly state that you cannot commercially use this data set. However, many of these are being used. So, enter open data-logy. Now that I hope I've highlighted the risks enough or my friendly ghost has scared you enough, I wanna show what we can do about it and how open data-logy is trying to do something about it. So, we wanna, like, we have four key thrusts of the project. One of it is we wanna develop tools that help do this analysis. We wanna develop the process that I showed because the process is right now manual. The process is the first thing that I demonstrated and the process is very manual right now. Even though it's comprehensive, it's hard to follow. So, we wanna create tools around it and we wanna work with standard communities so that all the metadata can be recorded in a standard format and the community can be encouraged to record this metadata in standard formats so that consumption and analysis becomes easy and we wanna build a community around it because as you might have noticed, analyzing the risks associated with each data set is both hard, time-consuming and requires legal expertise to be careful and legal expertise is in short supply. So, I wanna kind of quickly take you guys over the tools that we have. Right now, we have a portal which documents or which has all the metadata that we analyzed stored there so that if anyone wants to see if this data set is something, if they are interested in using this data set, they can go look up the data set and see what are the risks, what they can and what they cannot do with the data set. For instance, if you go into the license, it would say there are two views. One is the data set view and one is the license view. License view will clearly say what are the things that can be done and cannot be done and what are their obligations and similarly, if you click on a data set, it'll give you a similar view like this and the second one is if you have a data set and you wanna put it out there and you wanna clearly specify what one is allowed to do with your data set and what one is not allowed to, we have a tool that helps you generate the license in the format that I showed earlier by answering a few simple questions. So you can just answer these questions, say that I'm okay with people commercializing the output of my model, then those rights would be allowed and you can specify what are the obligations. And then community, we are on GitHub, we are welcoming contributors and GitHub space has details of how contributions can be made and we are pretty active and please check us out and as I've shown you so far, it's a big problem and it's hard for the eight people that I showed earlier to solve it. So anyone who's interested, please reach out to me or reach out to the project and we'd be happy to onboard you. And finally, we're also, as I said, we're also working with the standards community. In particular, we are trying to co-create the SPDX dataset profile and that is an example of how C410's data profile can be written using the SPDX version 3.0 and later if you're attending the compliance summit, I'll also demonstrate how we can use this profile to conduct the license compliance analysis but the key message that I wanna say is we are trying to make it easy for people to do this compliance work by making sure all the data is available in an ISO standard so that if dataset start putting their metadata in a standard format, one would be able to extract the data and analyze the compliance that's associated with it and similarly, we are working to ensure that this format of license which is an enhanced Montreal Data License is being adopted widely in the community to specify the rights and obligations associated with datasets instead of using either code licenses or licenses that were not created particularly for datasets. So that brings me to the end of my talk. I opened DataLogy Lobs Contributions and this is all the ways you can help. It's a very young project. We are aggressively trying to make sure that there's more datasets in there so that it becomes a repo that people can come and reliably rely on and some of the key things that we are missing is one of the first thing is legal expertise. All of these licenses as I showed, so many of these licenses are custom licenses which means that someone with the knowledge, legal knowledge has to go through them and tell us what rights can, what are the rights and what are the obligations? Like the software engineers, the AI engineers, the developers, we can extract all the data, we can do a lot of these things and but someone has to help us interpret it and once it's interpreted, the knowledge can be stored and that's where the tool development comes in. You wanna create the whole infrastructure where once a lawyer gives the rights and obligations decomposition of a license, you wanna be able to store it in the infrastructure and if there are compatibility analysis between two licenses that are already done, you wanna be able to use it next time and not disturb the legal expertise. So hopefully it's a exponential decay where the amount of legal expertise that we would need as the project goes would be lesser and we want people to help us verify the provenance that we already have, the lineage, conduct lineage analysis. As I said, lineage analysis is hard so any help there would be amazing and a lot of these are being, as I said, done by a very few people right now and as thorough as we are, we could make mistakes so you guys can even second this, like proofread our work or check our work, raise issues that would be amazing or help us create documentation tutorials and there's a lot of research to be done here if anyone's interested, like automatically extracting provenance, automatically identifying lineage, doing this compatibility analysis using first order logic, all of these are amazing research problems that people can write their PhDs on and it's a rich field and we also wanna work on standards creation and we'd be happy to talk to any community who wants to standardize the metadata and once again, I purposefully kept this talk shot so that we can have more discussions and if not, I can let you guys go soon but once again, I'm Gopi and this is all the addresses that we can be reached at. Thanks for listening and I'm happy to take any questions now. Yes, thank you for the nice presentation and we all know that it takes a very long time for community or developers to realize the importance of the open source software license so compared to the software license, so in what phase is the data side license now? So in your opinion. So I used to always start this talk of saying that data set licensing is right now like how open source licensing was 20 or 30 years ago and only messier. I didn't do it this time because I didn't want to be harsh, but I did it again. But yeah, it's in very early stages. People are starting to realize this is important but there's not a lot of momentum, there's not a lot of understanding and there is like even when a lot of consumers think that it is important, the only thing that they end up doing is they take the license that is slapped on the data set and see, oh, it's creative comments share, like all I have to do is like just share it in the same format so I can use it. Or sometimes it's like, oh, please cite the data set. C410 says please cite the data set if you're using the data set. So they think that it's de facto yes for anything that they want to do, but it's not. And they have to go do the prominence analysis, the lineage analysis, and do an interaction analysis to understand that there are many risks. Unfortunately, the community is not thinking at that stage right now. Okay, thank you very much. You're very welcome. Yeah, maybe another question. So for the software, the open software, so we have the GitHub to host the source code so we can do the license analyze or some other works. So for the data side, is there some platform to host the data side so we can do such thing? Yeah, there are several platforms right now. Say for example, there's a platform called Schieffer platform from Mindsport. There's Huggingface, there is Kaggle, there is papers with code. All of these host all the data sets and sometimes they even put a license on it. And like most of them put a license on it, all of these central repositories. But the problem is that they're all declared licenses, none of them are concluded licenses. Yeah, thank you very much. You're very welcome. Thank you very much for the presentation. And I'd like to understand about the lineage analysis correctly. So data set has a license, but the data set has many data. And I think we needed to respect all data's license. So lineage analysis means analyze each data's license. Is my understanding correct? Yes, so yes, you're absolutely right. There's an important problem there though. Here, identifying the minimum licenseable data unit. So I agree with you that we have to trace back all the way to the individual data point. However, some of these data sets, we are talking about billions of images. It's very hard to do it at scale. Let's put it that way. So right now, we're trying to stop at least at the level of if this was extracted from Google images, let's at least see if it's respecting Google images licenses. And but you're right, eventually in Flickr, they say that user license must be respected. So if it's Flickr, if you wanna be completely thorough, it is important that it goes to the individual data point level and puts all the licenses together. It's a hard problem. Thank you very much. Thanks for the question. If there are no other questions, thanks a lot for coming to my talk, even though it's the last talk. And thanks for the questions. I appreciate your interest, yes.