 We have Martin Miklamaya, who is involved in open source software within HP. He is a long-time Debian developer. He has been the DPL. And he is going to talk about Fosology. Thanks. So Fosology is a tool that can be used to analyze free software and open source code. And the main functionality at the moment is license detection. And given that Debian does care about licensing, and we had some discussions about how to improve the Debian copyright format and things like that, recently I thought it would make sense to give a brief demo of that tool here. So I'm just going to give a brief introduction a little bit about the background of the tool, why it was developed, what it does. And then the main part is going to be just showing how it works, just a few things. And then I'll briefly talk about the future of Fosology because it's actively being involved and also how we can use it for Debian. So like I said, Fosology is actually a framework to study source code of software. It can be used, it's very modular, it can be used to do many different things. But at the moment, the main functionality is to look for license information in source code. So it goes through the source code, it looks both for license texts, but also for license references. So if you say something like this code is under the GPL, that's not a license itself, but it's a license reference. Just a little bit about the background of where I'm coming from and why Fosology was developed. So I nowadays work for HB's open source program office. And one of the main functions we have is if anyone within HB wants to put open source into a product that's being shipped, then they need to come to us. And we have a formal review process to make sure that they understand licensing. For example, that they don't do things like link GPL code proprietary software or that they don't ship GPL without the source code. So we have a formal approval process for that. And for that, we obviously need various tools and one of them is Fosology. And so actually I should say upfront that I don't work directly on Fosology. I work on something else, but I know about Fosology and I thought that it would be interesting to do a presentation here because I'm sure lots of people are interested. But I'm not an expert, so if there's anything I can't answer, I can happily take those questions back to my colleagues and send an answer by email. And just about the background of how companies work. So for me it's a really interesting experience because I mean I've been doing free software and Debian stuff for a long, long time. And for a long time I was a student, so I did masters and then I did a PhD where I mostly did Debian stuff instead of doing my research. You know how that goes. And basically, I mean Debian cares about licensing and there are some other projects like KDE, they really care. But there are many projects that don't care, that don't think about it. And basically the way I think most developers work is well if you're part of the bigger Linux community, you just stick the GPL on it like everyone else does and that's it. You don't really think about it very much and if you're from the BSD camp where you put the BSD on it. But I think most people, they don't really care. They just do what other people do and what they think is expected. But then I joined HB and it's interesting because as a large corporation well they care about licensing, they really care about that stuff. So our lawyers care and we have to care. And we, like many other companies, we ship loads of open source code so even if you don't see like a projector from HB, it may contain some open source. And so you need to think about the licensing and what that means. And I guess there are two steps for large companies. One of them is sort of like a procurement office. So you need to actually know what open source is being used and is being shipped. So you need to kind of keep an inventory to keep track of what you're actually shipping. And the other thing is you need to know what licenses are in that code and to make sure that you follow those licenses. So if it's GPL that you make the source code available or that you don't link to proprietary code or in some other cases that you give credit. And so HB, we want to be a good citizen. We care about that stuff. We want to do it right. And so we had to develop some tools to make it easier because if you ship loads of source code or software, you can't go through everything and look at it manually. And actually if you look at large companies or companies that use open source for many of them, it's quite new. They don't really know what it means. So there are some problems we see with free software licensing. So one of them is people say, oh, it's free. It's open. And that means we can just do whatever we want to do. We don't have any obligations. And then you say, well, that's not right. If it's open source, it has a license. And the GPL says that you need to do those things. And that license says you need to do those things. But some companies don't know that. So when we talk to other companies, we see that, especially now with the networks, with companies in Taiwan, it's still very new for them. And they basically treat free software as just some third party code. And they think, oh, we can download it. It's free. We can do whatever we want to. And slowly they're understanding where we have obligations. We need to care about that thing. But that's a pretty slow process. And like I said, you need to keep track of what you actually use. That may sound easy. But if you have a large company, maybe you have a central office, which keeps an inventory. But if you don't, then you have different business units, all of them making their own decisions, and all of them, they might know what they have shipped. But there is no one within the whole corporation who knows what's being shipped. So in my opinion, it makes sense to have some kind of central inventory or repository. But again, that's not what everyone does. And finally, when you know what software you're shipping, you still need to know what licenses are in there. And in many cases, if you look at some package or some upstream code, it will say, well, we are GPL. So if you look at RPM, they have a license field. And that says something like GPL. And then you think, oh, it's the GPL. We just follow the terms of the GPL. But if you actually look at it, there are so many packages that are something like 90% GPL or 99% GPL. But in many cases, you can find different licenses. And that's often because people incorporate code from other projects which have a different license. And then you can't just say, oh, it's GPL. You actually need to look at every license. So it's kind of complicated. And that's why you need tools. So HPE, so that open source review board, which I mentioned. So we look at the code. We look at licenses. We have lawyers who actually know about free software and open source licensing. But we needed some tools. And so for Solergy, it's based on some internal tools which we have developed. But for us, we need to do that. But it's not competitive for us. So I mean, there are other companies, and they need to do the same. Nokia, Siemens, everyone who ships free software, they need to review it. So we said, well, why don't we make our tools available as free software and share it with other companies and get them to work on it together. And that's when for Solergy was born. So it's a free software open source project, mostly GPL v2. It has public website, mailing list, all that stuff. So how does Solergy work? I mean, it's pretty simple. You basically load code into the repository. Then you analyze it, and the results are stored in a database. And then you can look at the results. Looking at the components that make up for Solergy. So you have the software repository. So when you upload files or an ISO image, then those things are stored, not in a database, but somewhere on the file system. That's the software repository. Then you have the database which stores the results of the analysis and a couple of various information about the things you have uploaded. Then you have what we call the agents. So that's like the plugins that actually do the work. So they do various analysis or various functions. So you have an agent that unpacks things. So you can just upload a depth file or you can upload a tar file. You can upload an ISO image. And the unpacking agent will just take care of it, unpack everything. Then you have agents that actually analyze it. And like I said, it's very modular. So people can write agents. And people have been thinking or planning to write different agents. For example, code reuse. So you look for lines of code, whether they're being reused somewhere. And that's both interesting from a legal perspective for some companies to make sure that you don't put or you don't copy open source code into proprietary code. But it's also interesting for developers because if you can find code being reused, you might be able to function it out and put it in a function and don't duplicate the code. Then there is the scheduler which runs the whole things. And finally, you have essentially a web front end. But you could also access the database directly. So that's the sort of the introduction part. And now I'm just going to show it because that's the best thing to do. So we have a repository which we use where we upload like Debian and Fedora, things like that. At the moment, this repository is not publicly available. Unfortunately, we wanted to do a public repository with like everything Fedora, Debian, SUSE, but some of the lawyers of the Linux Foundation were a little bit worried that if we have a public repository showing license information, and then if you find some license problems, then they were worried that Microsoft might come and, you know, we've bad publicity. So at the moment, we don't have a public repository, but people can download for Solergy. So there is a Debian package now. There are RPM packages. You can just install it, have your own repository. So Debian, for example, we could just set up a Solergy repository if we think that it's useful for us. So that's just the welcome screen. And so there is some stuff in the repository. So I'm just going to browse. So it's just a simple, like a file manager. So I'm going to go to Fedora because I think Debian is not quite complete at the moment. So like I said, you can just upload a whole ISO image and it will unpack everything. Or maybe I can just show that. So if you go to upload, there are different ways of uploading things. So you could just upload a file from your machine or you could, on the server where you run for Solergy, you could mount an ISO image and then upload from that. Or the easiest thing, you could just upload from an URL. You just put in a link and then it will download it. It will unpack it and do things. And one short license, what that means is that you can just paste a file into it and then it will analyze that. But that's not what you would usually do. So if you go to browse, so we look at the ISO image. And so now it's loading because it has like 1,300 source RPMs in it. And I mean it's just a simple file browser really. The interesting thing is if you go on the RPM you can see what's in there. So you see the upstream tar file and then the spec file and things. You can look at some meta information about the file. But the interesting thing is so let's look at the tar file and now let's look at the license results. So if we click on license, you can now see the licenses that Solergy has found in that tar ball. So it will show you the count, how often something has been referenced. You can then take a look at it. You can look at the actual files and see where it has found something and there is a description. So we make various differentiations like GPL, V2, V3, LGPL. You can see GPL from the FSF. But that's something you can define. Or you could also group those things because it's all like GPL. You can just say, well, I don't care about those differences. Just show GPL. And as I said before, Solergy looks for two things. It looks for both actual license texts. But it also looks for references. So if you say this code is under the GPL or this code is distributed or you can copy, those are some of the words which we look for. And sometimes obviously you get wrong results. It doesn't actually refer to a license. But again, it's a legal thing. So the lawyers, they would rather look at more things than miss something important. So let's just take a look at the GPL here. So you can now see all the files where it has found this particular reference. And if you go to the file, it will show you the match it has found. So it will find like a 97% match. And it's all matching. So you never, you know, nothing is 100% matches. But in many cases, things are changed. And we will see a couple of interesting examples. Because something which we see very often is that, for example, some of the BSD licenses say something like the copyright holder and then some people replace that with their own name. So you get, you know, it's not a 100% match, but it's pretty close. So we find those two licenses in this one file. And so if we go on view, it will take you to that particular part of the code where it has found that license. And yeah, you can see that's the license reference we usually put into the code. So yeah, it's good. It found that. So it's highlighted what matches the template. So Fosology has templates of the licenses, which it knows about. And it basically does an intelligent match against those templates. And so it highlights what matches the template. So this program doesn't match. And that's why it's only 97% and there because it doesn't match that comma. So if we go back and if we click on ref, that's the reference. So that's the license where it actually tries to match against. And you can see because the template we have says this file and they set this program. And here they use a comma and here it's a dash. So that's why it's only 97% and not 100%. But you can see it has found the correct license. And here it has found another one. And that's this part here, which mentions as published by the Free Software Foundation. And that's why we call that GPIO from FSF. And you can see all of those files, they matched the GPIO. It's a 96% match. Again, that's all pretty simple. So something we also do is we look for phrases. So that's what I mentioned before. We look both for license templates, the actual license texts or common things you put into source code. But then we also look for some magic phrases like code is distributed is you can copy things like that. And obviously sometimes here you get wrong results, but it's still just useful to look at it. So for example, this program is free software. That's a phrase. Here it has found backup is free. Because again, that might be something you want to look at. I don't know what it... We'll copy a file. So for example, that would be a wrong match. So you don't care about it. It's just some error or message. But it says we'll copy. So maybe it could have been a license reference that you can copy or you can't copy something. So maybe let's just look at a different package to see. So I found one which has a number of different examples. So again, we click on license and all the analysis of the licenses that's done when you upload the code. So it's only done once. It's then stored in the database. So when you click on the license, you can see the result is pretty quick because it doesn't do it now. So again, you can see some GPL. You can see MIT with copyright clause. Maybe let's take a look at that. So yeah, that seems like a pretty good match. And then, yeah, GPL with exception clause. I mean, the thing is that there are so many variants. It's not just the GPL. There are different ways of putting the GPL into code. And there is the GPL v2. There is the GPL v2 or higher. There is GPL v3. You can have the GPL with the exception clause. There are really many things you need to look for. And there are so many licenses. So yeah, so here it has found just the normal GPL. But if we go down, it has found some, hold on. No, yeah, that's actually a bug, which I just reported yesterday. So this one should be highlighted in green. Because up here, it has found the green GPL exception clause. And here, this part should be green. So that's a special exception. That's the exception clause. So that's again something you need to know. So you know it's GPL v2, but it also has the exception. There may be LGPL. Yeah, so here is another example of the LGPL because it used to be the library license, but now it's the lesser GPL. So here you can see it has found a 98% match LGPL. And if you go down, you will see that it has found, because they say library. But the template which we have is the new one. It says the lesser GPL. But again, that's just the template. So you could also add a template for the old one. Maybe just look at the phrases. I mean, that's a good example. It was later released. I mean, here it doesn't. But that might be it was released under those circumstances or something like that. And here you can see this is one example where you have 30% references to the GPL. But then you have one reference to this license. And you have all kind of different references. And I mean, those examples are actually pretty tame. But I found some examples where you get a list of like three or four screen shots full of licenses. I mean, it's really insane. And those are really tricky because you need to check if the licenses are compatible. It can be kind of tricky. And even if you see that, you still need to actually look at the code to see if it's being linked together. Because it might be, it's just an example project under a different license, but it's not linked to anything. So that's fine. But it might be something else where it is linked. One of the examples is the SETLIP, which is some GPL incompatible license. And then there is one file, which is GPL. And that's actually being linked to the rest of the code. And so someone found that and they talked to the developer of that code. And the developer said, oh, yeah, that was a mistake. I'm happy to re-license it. But apparently, I don't know if it's fixed now, but for a long time the upstream code never changed the license. So if you go to some mailing list, you will find the reference where the guy says, yeah, you can re-license it. It's fine. But if you don't know that, if you just look at the code, you will see, oh, my God, it's GPL linking to incompatible code. And the lawyers will get crazy about that, I mean, for good reasons. And so another thing which we're interested in is actually helping to clean up those problems. So if we find such problems, to talk to the upstream developers to get that changed. So anyway, so that was the license browser. So just a few other things. So like I said, you can upload things here. So you can organize things, you can create directories, things like that. You can also define licenses. So for example, there are what we call license terms. So if we look at like GPL, let's say GPL exception clause. So that's how those things are defined internally. So you can see, so that's the GPL exception clause. And then if you go down, this license is associated with this group. And if we look at that, then that's the text we saw before. So that's the template that's being used to look for that. But you could change that. You can add licenses. You can modify things. And this one is actually, this license term thing is actually going to be rewritten to make it easier. But it's, I mean, it works pretty well already. But maybe if we look at GPL or, yeah, so that's more, let's look at GPL v2. So what you can see here, so if we go down again, you have those references. So if you look at this one, it's probably the whole text of the GPL v2, right? That's the template. But then it knows about those different references. So because there are many different ways of referring to the GPL. So this is a very common thing. So you put that into the source code and then you have a main license file with the text. But obviously there are many different ways of referencing. So all of those need to be defined. And what you can see up here are terms associated with this group. So those are the phrases you saw before. So it looks for those words. So if it sees GPL 2 or GPL v2, all of those different variants, it will flag that as we found something related to the GPL. Maybe it's a real reference. Maybe it's just a phrase. And another thing which I think is particularly useful for Debian is that you can group licenses into different categories. So for example, you have the free software foundation. They have a list of approved licenses. They have a list of GPL incompatible licenses. Then you have Fedora. They have a list of good licenses. They have a list of bad licenses. And it's the same for Debian. So you could define a DF-SG category where we would define what licenses we have found to be free. And that makes it easy because then if you analyze something, then you can just ignore all that category which is free and look at all the rest. That would make it very easy. So if we just look at those groups. So for example, we have in here, like I said, it's the FSF group. So you have what they consider free documentation, what they consider incompatible. And then you have Fedora. You have bad licenses, good licenses. The only problem is that at the moment, neither Debian Legal nor FTP Master really maintain an official list of what we consider free. I mean, obviously if we look at the repository and see what's in there, you can find out which they found to be free. But there isn't one simple list. You could go to the Viki. Whereas in the case of Fedora, they have a Viki. It's actually pretty good. Where they define the good licenses, the bad licenses, why they don't like it, things like that. So if we just look at this, so you can define colors. So the good licenses are shown in green. The bad ones are flagged with red. And you can see that those licenses that are in here, all of them have been found by Fedora to be okay. And if you look at the bad ones, then, well, they wouldn't accept those licenses. And if we go again to the browser, so we look at the RPM, we look at the tar file, and instead of going to license, or let's go to license, so you can see that. We saw that before. But it can get pretty long. I mean, if there are many different licenses, it can get, like I said, it can be pretty long. But if you go to license groups, then it's much easier to see what you really care about. So that worked yesterday. Let's try with this one. Yeah, I don't know why that's not working. But basically it just shows you, instead of showing each license, it would just show you the license group it has found, and it would just group the licenses by those things, and you can just click on what you care about, and then you will see only that. So it's pretty useful, I think. If we think about using Phosology for things like new, then we could define what we consider as free, what we consider as non-free, and then you could just load, if there is an upload in new, you could just stick that into Phosology, and then you can take a look at the results, and it makes it much easier. So obviously, for maintainers, it can also help you generate the Devin copyright file, because instead of having to look for everything, you could put it into Phosology. You can see OER, it's GPL here. That's simple. But then you can take a look at the other things that may be different licenses and things like that. So it just gives you a good overview, and then you can still go to show, you can look at the individual files. I mean, you still need to do that. It's not magic. I mean, Phosology, it's not going to just give you a Devin copyright file. You can stick into your package and be done. I mean, that would be nice. We'd like to have that, but that's just not reality. You always need to look at it. I mean, there are... Phosology, like I said, is being actively developed. So we are working on improving both the accuracy and the speed. So it's going to get better. I mean, it's pretty good already, but it's definitely improving. So with the next version, we're actually going to change the whole algorithm of how to look for licenses. And... Yeah. So I don't know. Are there any questions at this point, maybe before I close the browser? Could you give me an initial copyright file that I could use? Well, yeah. So at the moment, it doesn't, but that would be something, I guess, that would be pretty easy to produce. So like I said, all of the analysis is stored in the database. And at the moment, the reporting that's being done is, as you can see here, it's the web interface. But we had some conversations, for example, with Siemens. They would like to have some printed, some PDFs of the results. So they're going to work on doing that. And so I think it would be pretty easy. I mean, you have the database, everything is stored in there. You could just connect to the database and get the information and produce something that can be used. But you still need to check it. The problem that I have is we have Open64, which is a C compiler from SGI. And we did a lot of work to create the package. And when we uploaded the package, it was rejected by the FTP master because the Debian copyright was incomplete. Mm-hmm. Yeah. Yeah. Well, actually, so maybe I would just talk about the features that are being developed because they, I think, also relate to what Debian might be interested in. So, like I said, the whole algorithms that are currently being used, they're going to be replaced with something else, or at least in addition to that, they're going to be offered. Another thing, at the moment, there are those license categories, and that's going to be replaced by something we call buckets, which can be used to create license categories, but it's actually much more flexible. So, for example, in some cases, if you look at the template, if you do the matching, it might find something like the BSD, but because of something, some other reason that you can define, it might actually be an MIT license. And so with buckets, you can define those things, so it's much more flexible. And another thing, it's not exactly what you ask for, but I think that's also important for us, for Debian. At the moment, the way facility works, it goes through the source code, and it looks for licenses. But the question is, well, what about those files that don't have any license information? And at the moment, we don't show that. But that's something we really want to do because especially those files might be something you care about but what does that mean? Does that mean it just inherits the license from, you know, is there a main copyright license? Or does that mean maybe you can't distribute that file? And I think that's also something we in Debian care about. And then another thing which might be useful is that facility is going to be able to look for copyright and offer information. Because again, because some licenses says that you need to give credit. So we need to know who to give credit for. So we need to know who has actually written that code and for Debian because we need to put that into the Debian copyright file. Again, that's something that would be really useful. I mean, that might make the work much easier for the main people. Another question is, how do you deal or how does this facility deals with your licensing? Well, it just shows you both. So if you remember I don't know if I have but it would just show you both. So hopefully I will. Well, up here we had some examples where it says 97% GPL2 and then it said 100% GPL from FSF and in the case of tool licenses it would just say 100% match GPL and 100% match something else and then you could take a look at the file and you could say I mean, I guess you could, maybe that's done, I don't know, you could also define phrases like tool license or phrases that are being used to refer to that and then you would also see the phrase tool license and then when you look at it you see oh, there is GPL, there is another license and there is the phrase tool license and that sort of gives you the way you still need to actually look at the file. But yeah, maybe that's something that could be improved but that's how it would be displayed at the moment. And the other thing that I wanted to say that's not really related to phosology that's a different approach. So phosology is the approach where we have the code, we need to know what's in there so it does the matching and things like that and then there's the other approach where like the machine-parasible copyright file where you do the analysis and then you generate one file which is the authoritative file you can use and for us we need phosology because those authoritative files are not there but for us it would be great if instead of having to look through all the code you could just take the like RPM has a spec file or Debian has the Debian copyright you could just take that file and you know everything you know who owns the code who has written it, what are the licenses, what are the exceptions who do you have to credit so but obviously doing that for you know all the packages out there is a major effort but it's something we've been thinking about trying to work with some people like Debian who cares about this stuff Debian the Software Freedom Law Center they would also be a good place because they have good recommendations about how to do licensing maybe we can come up with some kind of standard and maybe that's similar to the machine parts of the Debian copyright format or maybe it needs to be extended in some way but I just wanted to say that in addition to Debian there are other people there are companies who would definitely be interested in doing something like that and if we can work on that it would make life easier for everyone there is also a project in Finland called Validos it's the same story like I said every company that ships free software need to look at the licenses to know what they're shipping and what obligations they have and every company does the same work and there are some things that are different like one company might say well if it's GPR we're not going to ship it or another company might say well we ship GPR we don't touch GPR v3 there are different things I mean it's the same with Debian we have a slightly different interpretation to the FSF or to Fedora I mean we can't agree on that but there are some things that every company needs to do like validates that things are okay so the Validos project in Finland is basically a couple of companies where they said well why don't we share our resources which is pay the lawyer wants to do that stuff to validate the code and then where we differ we can just make that ourselves so that at the moment this is a project it's member only so you have to pay I mean obviously because it's pretty expensive to do that kind of work but hopefully some of the things they do will be given back to the community so we're currently trying to work with them to see how we can work together and what we can give back I mean we are so HB is not involved in Validos at the moment but we are talking to them because they do something we're interested in as well so yeah there are those two approaches so I think for now for Solergy would make the life easier for people for packages who need to go through especially large packages you can at least see roughly you get an overview and then maybe we could export like an initial Debian copyright file that then still needs to be checked and improved I think that should be possible so that's basically it from my side so I really just wanted to give a brief overview that's the tool there are Debian packages so you should be able to install it if there is enough interest maybe we could set up like for Solergy Debian net and maybe later make an official service it really depends on if people find it interesting but the thing is that it's actively being developed we really care about getting other people involved and it's going pretty well actually so you have big companies like Siemens Nokia to some degree some banks in the States they have been looking at for Solergy and some of them have slowly started contributing code and we continue to talk to other companies and communities so I actually just recently found out that the free BSD people they are obviously doing something I still haven't found the time to look at it in detail but they have started to use for Solergy for some of their license analysis for the free BSD port and I know that there was one and new sense guy who was interested in using for Solergy to find out about licensing so it seems to be a pretty good momentum at the moment so if Debian would want to if Debian developers want to become users of for Solergy and if there are things that are missing I think it's quite likely that something that could be resolved and being worked on Does for Solergy look at binary files like fonts and stuff like that? At the moment it mostly does source code but that's a very frequent request and the whole the algorithm it doesn't care about whether it's text or binary so it can look at binary you just need to actually do something useful with it so you could I mean for Solergy I've now shown at the moment it mostly does the license detection thing but it's really it's a framework to analyze code or like text or software or whatever you want to call it so yeah you could definitely look at binaries and do some interesting things at the moment that's not being done but there is some interest both so I mentioned the code re-usability code re-use thing where you copy code into different code again that would be based on source code but it would also be interesting to do that on a binary level to see what's being linked especially in the binary so that's somewhere on the roadmap but probably not like the next thing okay if there is nothing else then thanks very much for your attention and if people are I mean you can install the Debian package it's in unstable Matt Taggart did it and if you have any questions there is a mailing list or you can ask me and I'm happy to put you in touch with my colleagues they're really good people and very responsive and I just don't say that because I work for HPE it's really a good project and there is momentum so now is the time to do something it's kind of interesting because we've been talking to other companies that's the fun part of doing these things is because we have our own policies we have our own tools we know how we do things but we are not perfect companies that do the same and it's interesting because now we can talk to them and say and we found that you know most companies have some tools to detect licenses I mean all of them need something like that but it's really everyone has their own solution but by making a facility available under the GPR we hope that people are going to standardize on that and it's one tool instead of the wheel and there is quite a bit of interest both from companies and projects okay thanks again