 Greetings, everyone. Today, I'll be giving a talk on Open DataLogy, which is an open source data license compliance project. But fair warning, this is my first in-person talk after the whole pandemic. So I'm really excited, also slightly nervous. So before we start as a note to the past two years, can everybody see my screen? Great. A little bit about myself. So my name is Gopi Krishnan Rajpahadur. And I started my career as a programmer analyst in CSC in India back in 2012. And then I started doing cobalt programming, actually, for two years. And then I got tired of cobalt. And then I went into data science. I was an engineer doing data science pipelines. And then I moved into a role that was senior software engineer. But I was mostly developing data science pipelines. And then I did a bit of research for four years in Queens University. And then now I work as a senior researcher in Huawei, where I look into AI engineering, data set licensing, and all of this fun stuff. So I've been working with data for a while. And that's why I feel comfortable giving this talk. So it's a few disclaimers, which my company really wanted me to put in there. All of these views are the views of the contributors, not of the company. That's it. I'm going to rush through this. Open Dataology. Like if there's one thing you take away from the talk, this logo, this name, that's good. Like I won if that happens. But why should you care? And I'm going to convince you over the next 20 odd minutes, why should you care. And the next 10 minutes, I'll tell you a little bit about what we did, which you can possibly find. So the recent, like software at the world, past tense, AI is eating the world, present tense. And this whole revolution is kind of fueled by the availability of massive amounts of data. As you can see from a lot of evidence, news articles, research that it's a huge market. And this is a graph that was put out by Andrew. And when deep learners started becoming popular, the performance of deep learners was purely attributed to the availability of large swaths of data. And almost nothing else. And how do we get this data, typically? There are only a few ways, actually. One, we can go buy the data from someone, procure it from a third party, they sell the data. Or one can go ahead and create it from scratch, which is actually neat, like you control end-to-end everything, it's nice. But both of them are expensive, time-consuming, and frankly, too tedious. And so what do people mostly default to? They use an existing dataset. They either download a dataset that's already available, nicely published like ImageNet or SeriesKF, CIFAR, or they go hit these different search engines like Google, Flickr, and curate their own dataset for their custom purpose. However, as most things in the world, they have licenses. And each license has rights and obligations. So for instance, rights are what, rights tell you what you can do with the dataset. Rights in a license tell you, these are the things that you're entitled to do with the dataset, and these are some things that are not allowed. And obligations are, if you use that right, or if you cash in on that right, what are the things that you have to do so that you can continue to enjoy those rights? So for instance, down in the here below, I have the license and obligations of some common datasets, sorry, famous datasets. And so all of them, at least most of them, have some sort of license out there. And in general, we can abstract it out to license as rights, license outlines rights and obligations. And in this project, we, sorry, yeah. So our project is about making sure that the rights and obligations outlined by the licenses are indeed valid and they can be followed. And I'll tell you why this may sound very familiar to most people who have been working with open source for a long time. You have, all open source software has license and why am I making a fuss about it? Why do we need a whole project? Yes, this is why. Unlike open source software, conducting license compatibility analysis for datasets are a little bit hard, primarily because the provenance, lineage and license related challenges differ from open source software because in software, the IP is clear. The copyright is fairly clear. However, when it comes to data, it is not. Like who created the software, who created the data, who owns the data and who curates the data can all be three different parties. And they all have an action that they perform, which means do they own the IP? We don't know, it's an open question and we don't make any claims here. We are just saying there are risks. And doing this analysis is hard. And I'll tell you why is it hard? So some of the provenance related challenges are like unclear licensing range. I'll go through each of these challenges. This is just a slide that kind of gives you an overview, gets you hooked in and we'll go through it one by one. So provenance related challenges. One of the first provenance related challenges is unclear licensing range. What do I mean by that? Let's consider this data set C for 10. And you wanna use C for 10 in 2022, which is now. However, this data set was created in 2009 and this is all they say about license. They say one line which says that please cite if you use this data set. And this was created in 2009, which was actually created from another data set, which was called 80 million tiny images. And that was in turn created from a lot of these other search engines, which were hit, we don't know when, because none of these papers documents that. So the license from an arbitrary range there should apply because licenses of many of these other search engines have evolved now. And we can't use the license currently. So we need to find the ideal licensing range, which is a very specific challenge. And then unclear licensing locations. In at least in source code these days, it's fairly straight, there are only a couple places where you can find the license. It's either in the source package or in a read me file or in a license.txt. However, for data sets, it's still very wild out there. There are sometimes the licenses are present in a completely different webpage or sometimes it's only present in the paper with which it was published or sometimes it's not at all present. Like sometimes there's just a note that says to do this if you're using this data set and that's about it. And the other problem is many of these data sets because of the pervasiveness of machine learning models, they're available in so many different platforms that each platform can sometimes even slack the license of their own. For instance, because C410 license is not very clear. Some of these platforms say it's licensed under MIT. Some say it is licensed under Creative Commons. And that's a problem because for anyone who downloads it, they might just think that that is the license that applies while the source is not clear. And for lineage-related challenges, this is actually interesting. For C410, if you just read the paper, they say, oh, we created it from 80 million tiny images. However, 80 million tiny images data set was actually created from several different sources. And if you're using C410 and if you're not doing a deep dive on it, there is no way you're gonna find all the sources that are associated with it. Which means that all of these license actually applies to C410 data set, but it's removed. And like nobody knows unless we go through and analyze it. And another challenge which we are not tackling as a part of this project yet is identifying the minimum licenseable unit. So many of these image data sets or many data sets have different levels of abstraction. So for instance, these have, so C410 was created from 80 million tiny images. 80 million tiny images was created from all of these different search engines and each of them have an image which has a license of its own. So which unit do we consider? Do we just say, okay, let's just go till Google and after that it's unmanageable. So we're just gonna stop there. Are we gonna say that? Or are we gonna say every individual image license should apply and then final license should be a representation of all of it? We don't know and identifying this is a huge task. And finally, the license interactions are not very clear. Like how MIT license and LGPL interacts is fairly well studied, but how please site if you intend to use this data set and this custom license interacts is not straight forward or clear. So these are some of the common challenges. So open data algae. Now I hope I have convinced you that this is a problem and our project tries to solve this through a few different things by a license compliance analysis process and putting all that metadata that we get out there in a portal so that people can come and use the results and if people do this process, they can share the metadata and they can contribute and as a community, we start having a repository of what are the actual final rights and obligations that are associated with each publicly available data set out there, eventually, but at least to begin with some of the common ones. And I'll tell you how we kind of can do it with this project for first publicly available data set. Let's to make this easy because I usually understand things through examples. Let's take an example where an AI engineer wants to use this HIFATN data set for either to commercially distribute it or to release a product with AI model or to commercialize the output. And because the license starts VE as I've been repeating just says please cite it, he is not clear. So he wants to go through this, we suggest he goes through this process and that would enable him to analyze the final rights and obligations. For instance, the process looks something like this and I'll walk us through it. So first one extracts the license and the key thing here is they do their best effort possible to identify it from the source, but if not wherever they download the data set from they capture the license there. And then in the provenance extraction step, it's key to identify the official source of the data and see if the data set source, sorry, the license that they originally obtained and the license that was provided in the source are the same things. If not, we suggest them to file a PR or do something to change it. But if not, at least now, you know, and kind of mark it as an untrusted source and use this official version of the data source. And we have some templates with which we suggest that one should extract the metadata associated with the official source. This helps us kind of mitigate the non-standard license location problems. And then the fun part, lineage extraction. This, right now we do it manually through just elbow grease going, we trace the data set creation process by reading the paper, identifying the sources. And then to do the licensing range, so far in the project, what we recommend is set a, like find the top level source, say for instance, if it's three for 10, for 80 million tiny images was published in 2006. So hopefully the license range of all the data sources around 2005 to 2006 is what applies and we recommend one to use wayback machine or something like that to extract the license from that time and decompose the rights and obligations. And once these details are extracted, most AI engineers are able to do this. We suggest that they put this in a standard format which I show, which I will show, and in the data provenance table and the data lineage table and pass it on to someone with legal expertise. Eventually we hope this is a database where someone can submit this metadata and ask if what is the final rights and obligations and there is enough atomic pieces in this database and the interaction analysis is done that results can be automatically provided but for now we have six data sets. So that's not a lot. So we suggest for the license compliance assessment, the lawyer or someone with legal expertise takes this license of the different sources and puts it in a format like this. This is a format that we came up with which is based on Montreal data license which was proposed by the Mila group, the Ayrshire Benjia group in the University of Montreal and we extended it for few more use cases. We blew kind of represents where we extended it but we are promoting this format because this acts like a good intermediate between the AI engineers and the legal team. So AI engineers have enough expertise to populate these fields and identify or consume these fields and identify the obligations that are outlined here and legal team can always generate legalese from this template and this acts as a good intermediate representation and a means for customizable licenses for the future. And once that is done, the lawyer puts all of those different licenses of different data sources in this format and then does a first order interaction analysis with legal expertise of course and finally ascertains what are the rights that are actually allowed. So this is an interesting analysis because if we just decompose Ciferton license like the top level license you would see all rights are allowed. However, after all the interaction analysis that we did we find that commercialization of the output or commercialization of the model or tagging, distributing and representing the model commercially is not allowed and so many people use these data sets currently and here are the results of our analysis. So our process can be adapted and retrofitted for pre-curated data sets at one curates two. The only change that they have to do is now if someone's creating their own data set they know what the provenance is they know the sources that they're hitting and we suggest that they use our templates that we provide to record this provenance details the lineage details and rest of the process remains the same. So that is the overview of the technical stuff or the crux of our process. I'm going to take us through what we have done so far and how we could use the community's help. Yeah. Our perspective is the, then after two or three months the same page or something is missing. So how can you... I'm hoping that the analysis that we do will always be present. We'll make sure that we maintain it but we can't speak for the data set creators. I'm hoping the papers that they publish are in archive and archive is immutable, like it's always there, right? Like and this is... Once again, this is... I don't know a good answer to that. That, but as the problem... The source of the problem is the fact that a lot of the people who created these data sets have not been very diligent at assigning a license or maintaining it. And we hope it changes with projects like these. So, yeah. So as I was saying, the current progress, we put it into four pillars. One is the processes that we need to establish. We work on that. And then we develop some tools and automations to enable those processes because if we say you have to manually do all of it, like beyond three months no one's going to stick around. So, and then we want to build a community around it as you guys are already probably aware, this is a massive, massive undertaking. It's not possible for one company or one research group or any one person to do it. And this is the need of the hour. It's a need of the hour. And we hope we can recruit the community, get them excited, spread the awareness that there are risks associated with using these data sets and can get people in. And we also want to develop standards around it. So I want to give a quick shout out to all the core contributors who started this project off. They did a lot of work, a lot of nights went into it. Thank you. So current progress is this process, this beautiful process. We tried to publish a paper around it. It's still on revision and you can find it in archive. And the results. So we always six commonly used data sets. And if you have been developing AI for a while, you might recognize most of these data sets. And this is what you can do commercially with it. But I can tell you, if you go to certain pre-trained model sources, they are charging money for some of the models that are trained on these. And I won't name names, but yeah. And to enable community building, we put all the metadata that we collected in a portal and all the final rights and obligations that we have analyzed in this portal. So it's free, you can go look it up and contribute to it if you do any analysis. And we decompose the final things into this nice, you can do this, you cannot do this, these are the limitations and these are the obligations. So it's easy to consume and nobody gets confused hopefully. And yeah, this is the provenance metadata schema that we have developed. So all the provenance details can be captured and the lineage details can be provenance details of different sources that is involved in the whole dataset lifecycle can be captured. And they're open for collaboration. I have the GitHub link both in the first slide and the last slide. And I'm happy to share it with anyone. We are actively looking for volunteers. And as for tools, I showed you that nice dataset license template. So we came up with the very alpha version of a tool which asks you certain questions, saying that, oh, do you wanna do research on this dataset? And it guides you through it to create a license for your dataset. So we hope the datasets that are being created will use this tool and they can use this nice extended Montreal dataset license so that it is easy to consume and it becomes a customizable license. And this is our tool. And we currently have rudimentary capabilities of generating the SPDX to work with SPDX in the future, a SPDX-like markup of our metadata. And we are actively working with the SPDX community kit here to have a data bomb. And once the data bomb is there, all this metadata would be represented there and there should be full compatibility with tool support. And these are the formats I've been talking about for a while now. The provenance, we suggest that right now be recorded in this format and the license be recorded in the Montreal Data License format. And once the data bomb is there, we would ideally want all this provenance detail represented in a data bomb so that it plays nice with the rest of the world. And we welcome feedback. Through all of this, any time if you have feedback, reach out to me. I'm online most of the times. And yeah, for the standards purpose, we've been developing an initial data bomb in collaboration with SPDX. So this is what we have done. And now this is the fun part. This is what we are trying to do there. We don't want to stop at licenses. We wanna extend this project to capture all the problems that come with copyrights, privacy, ethics. Yeah, it's a massive one to take. But hopefully enough people realize that there is a lot of things to be done if we have to use these data sets. And these data sets are going to power the next decade or so for machine learning unless we are going to radically move away from supervised and unsupervised learning models as a thing. We are going to need data on a lot of them. And all of these have to be answered, especially with regulations coming in and all the government starting to put the screws. And we realize that most of it is manual right now and that's hard. But we had to start somewhere and that's where we started. But we wanna start developing tools that tools like data clone detection which will help us analyze the lineage. So to identify and to automatically extract provenance lineage and do the final interaction analysis. For instance, for automatic provenance extraction and lineage extraction we have, we are looking into NLP methods to read through these websites, understand the semantics of the website and identify where the source is. Or read through the paper to identify where the official source should be located and stuff like that. And for lineage analysis where multiple data sources are involved and this has been around for a while now, we can use data clones, we can use data clone detection which kind of goes and finds that is my image a part of another, is the image in my data set a part of another data source like Google or Facebook and then say that, oh, this might be a potential data source for it. And there are many different types of clones if anyone's worked with code clones, it's like actual exact duplicates, slight modifications, or is it just a, is it something that someone took it and modified it so that it's a different source. So there are many different types of duplicates and the challenges that we have to deal with here, but we are currently working on tools that will help us detect at least type one and type three clones which is index data. So type one is clones that are just duplicates and type three is clones that have modifications or new additions to it. And this is hopefully the timeline we have, but we wish, we are very open on all forms of contribution. If you wanna come work with us on some research if any of this sounds very interesting, we are very happy to do it. Or if you wanna help us write those NLP codes to extract provenance or extract lineage, we'd be super happy. And if you're a liar, please contact me. And the licenses are hard. Legal language is clunky. We don't know enough and we can use the help. And that's my next pill. Like we wanna establish a governing policy which we are trying to do with the help of the LFAI and Data Foundation. I'm happy to announce that our project is sandboxed with the LFAI and Data Foundation and they are helping us with some of these policy creation and wiki creation. However, we can't officially put it up yet because the paperwork hasn't gone through. So and finally, we are also trying to take all of this knowledge and create awareness and at least from this point forward, make the world a little bit of a better place by working with standard communities to create standards. And I've been working with Kate Cairne to form this AI bomb profile where data bomb is also a part of it. And hopefully this creates more awareness. So the goal is SPDX will eventually capture all the details that are associated with data provenance, lineage and the license decomposition. So that all the sources are tracked, all the official sources are marked, all the licenses are available. And once it's all available, it's just hopefully a matter of doing the interaction analysis which I hope the platform grows enough to have enough interaction analysis done so that it can be codified into a first order logic or second order. So this is the look ahead. We have a timeline. We wanna do all of this. We need help. And if you wanna contribute in any way through your time, if you wanna just analyze some of the data sets, provenance, lineage, just mark the data sets up or join the discussions, just shoot us an email or join our Slack. We are very friendly bunch. We respond promptly and we will be really grateful. Thank you for coming to my talk. I know it's the last day of the conference. Thank you, super grateful. Yes. I wanted to say thank you for the presentations and for the community, what you are doing. It's actually great and what is needed for the environment, let's say, and also a good question apart from this data set licenses question can be a question that was actually raised by Daphne, their co-pilot. Yeah. Is it polite to use the code itself in this highly connected with the open source itself? And it also can be a good question That's almost one of the motivations why we started this project. Like, I don't like, I'm not a regulating body. We are not regulating body. So we don't know the answer. There are different countries can decide differently too. Also with co-pilot, I don't know the full case, but some of the things that I've read about some of the license terms were violated and that is not okay. Copyright is a different problem, but contract. That's a different problem too, right? And we are right now dealing with contract. And so your question was many of these data is being used and the copyright is not clear and what happens then, right? And that's why we need to do copyright compliance issue. We need to tackle copyright compliance issues, but also we don't wanna just rely on one government body to give a regulation, right? Like what is in US might not apply in Canada. So our project aims at identifying the risks. It may or may not apply to you, but it's always good to know the risks, right? And that's at least the goal. Yes. So first of all, we had presentation, I mean. Thank you. For something which... You should contribute. Join our Slack. Definitely. So there, are you also considering this? So are we also considering the... I think copyright and contract together should address that, but we're not explicitly considering patents right now, especially because that database is not accessible. Yeah. So hopefully enough lawyers get excited about it and they can help us with this. Are they different? Is ethics should be common everywhere, right? No, I understand. I think the ethics itself is the same ethics. We will hopefully be able to codify it to look what are the problems that are happening in datasets. But right now, as I said, there are a few contributors. We have limited expertise. We need people. And once there are people, we want to tackle ethics. We want to tackle everything. We want to make the world a better place, but we don't have the people right now. Please, China's. Thank you. So what's the relation between the standards and open data logic? Which standard again? Yeah, yeah. Oh, so, okay. Your question was in Q4, there's potentially a standard, what is the relationship, right? So open data logic relies on a lot of metadata to conduct these data license analysis. So we are working with the standard communities to create the standard so that all this metadata can be captured going forward and open data logic can easily use this data to compute the risks associated with the datasets. Because without a standard form of capturing this metadata, right now we are having a hard time collecting this metadata. Yeah. Because standards take time. So it starts creating now, but it will finish. Yes. Okay, thank you. Yeah. This is part of those PDFs. Yeah. Yes, sir. Yeah, it's a bit of a side question, but I wanted to connect to the co-pilot question. This is something that I'm actively working with OSI to develop a new license around. Because right now, legally speaking, you cannot sue someone for their code unless you demonstrate intent. And generative code has no individual to type intent to. So we need a new process to define where the legal agency exists within that space. So if you're curious about that, I would love to chat about that so we can get that done. Yeah, just to repeat to the audience, OSI is trying to work on a license that addresses the concerns associated with generative code and sales looking for people to work on it. Any other questions? Or if you, yeah, please take a picture, shoot a message, and if you know lawyers, bug them into joining us. Well, once again, thank you so much. The project's name is Open Dataology. I'm Gopi. I'm around. Come talk to me. And let's make the world a better place one day to set up time. Thank you.