 Hello, thanks for coming to this talk. Hopefully, we can have a little bit of discussion, either at this talk or later on on the mailing list. But this talk is structured in four parts. We will talk a little bit more background what machine learning is, or at least what it is in the concept of this talk. Then a little bit about threads and a little bit about opportunities. And in a nutshell, this talk is not like the sky is falling over us or anything. Not doing anything is really an option. The threads are maybe five to 10 years from now, really. And some of the threads can be addressed outside of Debian. But I truly believe that the opportunities are more at the distro level. And that's also something that you are welcome to disagree. So what is machine learning? Machine learning has this issue that people take it a little bit religious, just like the computer learn stuff. In reality, it's just some statistical modeling with a focus on predictive uses of the statistical model. And the most common case is there is a phase that is called the training phase, or you can also call it estimation. And for the way we normally work, we can call it more like a compilation phase, where we take as input vectors of features, including some target feature. And that's what we could call the feature data. And the output is a trained model. In general, this thing is fairly large. And the trained model, you can think of it as a statistical summary of the input data. And then you have an execution or prediction or interpretation phase, where the input is a vector of features without the target feature and the trained model. And the output is a predicted target feature. So for an example that I'm familiar with, and it's something I may want to package in some moment as part of the Debian science linguistics track, is a Stanford syntactic parser. It's written in Java. It's a GPL license. It's a fairly mature code. It's a surprisingly well-written code for coming from natural language processing. It's a probabilistic concept free grammar parser. And the trained model takes two megabytes. The source data is the Penn Tree Bank that was assembled at the University of Pennsylvania. Several years of Wall Street Journal annotated with syntactic data. It takes a whole CD-ROM. So you go from 600 meg to two meg. And this one is only available under a closed license. And you cannot distribute it, et cetera. And this is how the output of the parser looks like. We take a completely random sentence, not coming from the Debian website at any point in time. Like an operating system is a set of basic programs and utilities that make your computer run. And here you have a sentence. And this is the subject and the object, et cetera. So clearly, in general, a syntactic parser is just a part of a larger system, like, for example, the summarization and other activities. But this is a type of software we may want to distribute in Debian. And the issue is that some of these models are actually easy to understand and modify by hand. For example, in an earlier work I had, we were doing word-sense disambiguation. And we were using an algorithm called C45 rules that produced this type of nice-looking rules. So for example, you say, if the word after the name includes encodes, and it doesn't include encodes before, then most likely it's a gene. Because gene encodes stuff, yes, while proteins are encoded and things like that. So these type of rules are rules that you can think this could be a preferred form of modification because you can understand them and change them. On the other hand, most models are un-understandable models, which people told me used to say, incomprehensible. That are not really intended to be understood or modified by hands. And you have things like neural networks, although are not that popular these days, poor vector machines, Markov models, conditional random fields. It's a large collection of floating point numbers. They are very, very opaque. There is no real intention behind them to be dealt with by hand. So you most probably are familiar with neural networks because they are still quite thought. You have some input features come here, and you have these different ways that in this diagram are represented by different thickness on the lines. All these numbers get multiplied by these weights, and a function gets applied in each node and feed forward until you get to the different output class. So in total, if you have n nodes here, m nodes there, and l nodes there, you will end up having something like n times m plus m times l floating point weights. This makes a huge, huge space of possible weights representations. So just to represent the space for one bit neural network, it looks like this. And if you go to neural networks actually do something interesting, you get an n dimensional object that I guess Edgar will be the only person who can visualize that. And support vector machines are even worse because the concept is to find a hyperplane that divides positive from negative training data. But most SVMs use this kernel trick where they map features into a higher dimensional space with the idea that if the mapping is smart enough, then you can find an easy separation hyperplane. These are fairly deep technical stuff. But the main point is that you're not supposed really to look at the support vector machines and say, hey, I could modify this hyperplane in 10,000 dimensions a little bit and get a better model. What you do is you retrain from the source data. To make things even a little more moody, feature vectors, in general, are not unlike generating YAC or Bison C files. Yes, you have this difference between training data that is the transcribed speech, for example, and the feature data that are these little wave segments with associated transcriptions. Or, for example, if you have a trainable spell corrector, your training data is the Wikipedia history and your feature data are the edits that modify a word with less than two characters. This stuff is fairly detailed. Even if we just distribute the feature data, it will be way ahead of what is happening right now. OK, so what are the threats? I distinguish between threats to freedom and threats to practicality. The main threat to freedom is obsolescence. Yes, slowly the users are going to get accustomed to applications that rely on these large training sets. And this is not unlike the type of threats being addressed by the Freedombox Foundation for cloud services. Yes, so for applications such as optical character recognition, like book scanning, speech recognition that will be the detection, computer vision, automatically tag your friends on photos, and question answering, Siri and Watson. Yes, if people start expecting these type of things that they want to run on their devices, well, then we are going to lie ahead, like behind quite a bit. And overall, what we see here is that we have a diminishing value behind source code. Yes, and there is more value on data. Facebook, LinkedIn, Google+, Flickr, and that's something from BDL stock this morning that he was saying, well, when you are using these platforms, you are giving away your private information. Well, moreover, you are actually empowering these players to build better models. So it's not only that you are hurting yourself by disclosing your private information, you're also hurting the free software cost by enabling them to build better tools that we won't be able to compete without having access to all that data. And I don't know if you are aware, but there is this data vendors. You can go to this InfoChimp's marketplace and buy a nice chunk of 100,000 LinkedIn profiles and things like that. They are fairly expensive, but you are being sold right now. And again, even if we have the money to buy these, we can distribute the models as free software, but we won't be able to distribute the data. And unless you will, so to speak, the Pantry Bank data is being released by the Linguistic Data Consortium that is a nonprofit sharing the maintenance of this data. But still, most of the data distributed by the LDC is proprietary because the newspaper articles belong to the Wall Street Journal. So even if you want to make the notations for free, people need to get the newspaper articles. And to me, these issues of machine learning are yet another clever GPL circumvention trick. Yes, in the last decade, we have seen plenty of smart ways to navigate around the GPL. And in this case, then you have the vendor releasing the source code of what at the end of the day end up being just an interpreter, yes, and keeping the data behind train models completely closed. So the fundamental freedoms of being able to study, modify, and adapt to your own uses are being held back from the users. And to me, the exact metaphor and analogy here are binary firmware blobs, yes. Because in general, machine learning models have been treated so far in Debbie and more as video game assets. But in reality, you have the decision information that is used for coding. Now, if we were to go down the rabbit hole and consider the data and the building of the model as part of something we want to consider completely part of Debbie and us programs, then we have the issue that this is not similar to anything else we are doing in Debbie and at this time. Training machine learning models takes a whole different type of build machine. It's not unheard of to use 64 gig machines for three days. Yes, so redoing this work, well, we really will need to have a very, very good reason to do it. And most likely, we'll need to have different type of sponsors for that. And even worse, it becomes when you start talking about distributing this training data. Because as I was mentioning, well, you have some data, for example, derived from Wikipedia. And you may end up having to host that type of large data files. So this issue has been discussed already a few years back in 2009. There was a post from Mr. Matthew Blondel to the Debian legal mailing list. And what he asked was the first one was whether Debian can ship models in Maine without distributing the original data. And the answer there provided was yes, because the model is considered preferred for modification. The objective of this talk is to say, well, this answer is something we may want to revisit moving forward. Yes, because the reasoning there was that this is the same thing as having two-dimensional render images rendered from a 3D model in games that are not being distributed, the 3D model. To me, that's different because the two-dimensional images don't contain any logic about the program. And the second question was a little off topic for Debian. It was about fingerprinting of the data. And there were some answers related to that. One of the interesting things about this is that it's a massive framework. And that, in a sense, informed me about the dangers of giving this talk. So I hope we have a far extinguisher nearby. But there was a quote, for example, that I want to, as I said, free data is important for the very same reasons that free programs are. And going back to the fundamental freedoms of modifying and studying and adapting, I personally agree very much so with that quote. But on the other hand, it is a slippery slope. As another quote says, well, if you want ship models because you don't have access to the source data, then you shouldn't ship pictures because they are initially photographed of an object. And the preferred form of modification is the original object. Yes, if you want to see it. And of course, this was just flaming. But it is indeed the case, particularly when you go back to what I was saying of the training data and the feature vectors, et cetera. I did a little bit of a search within the Debian archive. And I didn't find anything that I would really put my finger on it. And this chemistry kit seems to have some data there. But I cannot really, I'm not really familiar. I know for sure OpenCV has a face that it can say, in this photo, there is this square where there is a face. And that's been trained from a library of photos. But I couldn't really find where in our distribution of OpenCV that data is. And I know that UIMA sandbox has some trained models. But I'm not familiar whether we are really distributing the sandbox at this time. So truly, I mean, this is not an issue we have right now. There is a very nice dictation software that I'm going to talk in a bit that is now being distributed with GPL data. And well, in that case, if we want to pack it, we have to do something about that source audio files. On the other hand, something, these make huge opportunities for Debian. In my perspective, the main challenge for Debian is to change users into contributors. Yes, how can we make people get more engaged with the project? And in a sense, we can try to get contributors that will push new training data. When they use a software and the software underperforms, they can help make the software better by sending data files. So I think that could mimic the success case of the many translation teams we have, for example. And data contributors can, in some sense, send data patches that will fix a bug by improving the model. This is very tricky because the statistical models tend to be asymptotic. So every time you retrain them, the model will improve. But in some cases, it will decrease in others. But still, this is a very good way to bring people into the project and feel they have more ownership on the software they use. And moreover, something that I would like to see to is more inter-distro collaboration. There is a lot of work going on between Debian and its derivatives, particularly Ubuntu. But this will be also interesting to see a collaboration with RPM-based distributions. And one of the things that is very nice about training data is that sharing data is usually much easier than sharing code. Because the format of the data seldom changes. And in a sense, that's what leads to object orientation. You encapsulate data because data is more stable. For example, all syntactic parsers in the last 15 years or so have all been trained on exactly the same 3-bank data set. So there are certain data sets that are fairly stable. And sharing annotation work should be easier, yes, than sharing source code patches. Now, here are some questions. How can we acquire this data? Do we want to build something like a free software volunteer-driven mechanical torque-like tool? I'll explain on the next slide what mechanical torque, if you've never heard of it, there are already existing initiatives like Librebox, where volunteers read books allowed from the project Gutenberg. And the second is, how can we assure the data and its derived model is kept free? Yes, we may need something similar to the Creative Commons soup of licenses. And it's, of course, the question whether just applying GPL to the data will be enough. These questions completely exceed me. I don't have the feeling that GPL is enough, but that's being used by other projects. OK, so the mechanical torque is part of the Amazon Web Services offering. It's a commercial proprietary platform where you write a task that is easy to do by humans, like there is a person in this picture, but it's difficult for computers. You have paid workers that are called in the slang of mechanical torqueers and do these tasks for really, really, really, really completely unacceptable, tiny wages. And of course, this poses plenty of ethical issues. First thing is whether this is just pure exploitation of the torqueers. And even more interesting is, well, maybe these people are being paid to help develop tools, develop systems that go against their own moral values, but they just don't know what they are doing, because the tasks are so precise and small. This website, Boxforge, it's found out about it very, very recently. And it contains a large number of GPL transcribed samples. And the objective is to put together a state of the art lactation systems using free software. And the intention is that the acoustic models that are derived from these sound samples have to remain free. And interestingly, the state of the art tool to train the acoustic model is called HTK. And it's proprietary. There is still no free software equivalent for it. But they have a, so basically this will be the compiler. There is no free compiler. But there is a speech recognition engine called Julius that is free software. And you can download a demo they have. But if we want to package Boxforge models, we also have to distribute all their source data, all these sound files. We'll have to build these models at build time. That will be difficult, because HTK, even though it's freely available, you have to sign a license to obtain it. These are not questions that haven't been answered in Debian before, but it might be the case for these type of cases. We don't need to distribute the source data ourselves. Just be sure that they are available on their site. OK, so the first thing is, don't take it personally with me. I'm not changing anything as things are right now. It should be OK for the time being. I still think it would be nice if we can find other people who are interested in this problem that we were willing to talk with maybe the software law center about any licensing or how to proceed about trying to make some inter-distro projects. And overall, I still think it would be good if we can revisit our current policy and transition away from considering trained models as the preferable form for modification. So in a sense, packages that contain these models that we don't have access to the data should be just put in contrib, because they are freely available, but the users cannot make use of them without access to these proprietary resource. And we may want to differentiate between the Debian archive hosting all the code and assets versus all the source data. That's a little more difficult. Debian really likes the fact of being completely self-contained at the archive level. Still, there are other organizations like archive.org that specialize themselves in hosting large amount of data. And finally, I don't know if you, the first C-lang 2.9 and 3.0 recompilation of the whole Debian archive that was undertaken recently was done on a European resource called Grid 5000. So there are large clouds available for research use through the world that we may be able to get time on to train models if we go for them as an organization. But what do you guys think? I had a kind of a comment and doubt about the GPL and these things. It seems to me that there's a problem that's common to metaprogramming. The model is not being coded by people. The model is output. So as I understand the GPL, the data set is kind of like it's a source of the model. So if you are distributing the code to a model that's been GPL, it seems to me you would be required to distribute the data set too. Yeah, that's the understanding of the Boxforge people, yes. And they seem to be very confident that the GPL is all they need to protect the data from that type of usage. I find that metaprogramming introduces very odd questions because there are questions such as who creates a piece of code. If you distribute software that does metaprogramming and somebody else runs it on their computer and generates code, but they're doing nothing but running it and you're getting novel code. Then who the hell does that belong to? No, no, no, that type of thing I think is a concept of the compiler, the target, the source. I mean, GCC is GPL, for example, and you can use it to compile any code you want. That's not so much the issue, but it's more about in this case whether the source code and object code that are the terms used there really apply to this where you have these speech transcripts and then you have an acoustic model as output. My bet for that would be GPL v3, I think. A lot of the changes in language make it a lot easier to apply to things like this. So I wanted to thank you for doing this talk, because I think it's a really interesting challenge for us going forward. I don't have any clear answers, to point out another analog to this kind of problem, which I think we're going to see more and more, which is fonts. And so we already are struggling with this a little bit in the fonts packaging team of if someone has a font that they develop using proprietary font software and they distribute it as a TTF. And we can modify TTFs directly. We have all the information of the font. It's entirely modifiable by us, but it's not their modifications. So that's one issue. But the part that's closer to what you're describing is that there are now a handful of auto-curning services that are being run on the net where you take a font and that you haven't actually curned. That is, you haven't specified the spacing between letters. And these auto-curning services can tune up your font and make it look better automatically through some algorithmic process. And so you just got this big change in your font in whatever formats you had. And now you've got a new font. Is that a free font or not? Because you had to pay this proprietary service to do it. And it's not clear what we can redistribute if we don't have the auto-curning functionality in Debian as well. So there's a bunch of places where these kind of questions come up. And is that being discussed in Debian legal or just discussed within the fonts packaging? We've discussed, on the fonts packaging team, we've discussed more the, okay, does the author modify this themselves with a free tool or do they use some other preferred form of modification? We haven't really gotten into the issue of the auto-curning services. I think that's relatively new. I don't know, other people on the fonts team might have more history than that. Hi, I'm afraid I'm gonna be bad. I don't really have a question, but I've got quite a sequence of points to make as a different, just, I mean, I agree with a lot of what you've said, but as another machine learning person, I'd like to give a slightly different perspective on a few things. First of all, you kind of seem to be throwing together all machine learning models here as being black boxes. Yeah, I was really trying to, I didn't wanna spend the whole talk on that. But I mean, well, I would agree with that, say for neural networks and support vector machines, a lot of models, for example, generative models, you can go in and interpret them and they do mean something. So I think there's a kind of, there are different sides here. There could be some grayness in the middle, but there are some models where it definitely does mean something once you've learned it. A different, and the second thing about the data sets, you seem to, I would, to be provocative, you seem to be a bit stuck in the past in thinking that we have these big, perfect data sets that are of great worth and that we really need to care about and that they are fixed forever. I mean, yes, you mentioned syntactic parses still using the same data sets that have been around for decades, but that could be because syntactic parsing is kind of irrelevant and gone by now. No, no, actually the progress of syntactic parses is stuck because they are with that data. Yeah, but I mean, all the current translation software, say, doesn't try to use syntactic parsing. It just uses lots of data. And in the world where actually what you just want is a lot of data, clearly in some way the issue, I mean, yes, I agree on, it depends on the kind of task. So say for something like voice recognition, there is a genuine issue because collecting, well, maybe Librevoxay could be a solution, but at the moment collecting good voice data to use costs money and there hasn't been a community initiative to do the same. But a lot of the problems we're actually worried about at the moment, I don't think the way you're painting it is quite, I don't think it's as black as that, as bleak, because a lot of things we care about at the moment are actually working with internet data, with web data. And in many cases, the people working on these wouldn't ever think even of building a fixed dataset that they redistribute. So what they tend to have is a script that fetches, say, some collection of web data than they use. And obviously, yes, there is, in principle, it would be nice to have be able to say the dataset is free and you transfer it around, but it's not just a question about, well, for one point it's not just a question about free nurse or proprietary, because a lot of the time no one has permission to redistribute that data. If you fetch a million images from Flickr, or if you fetch all the text pages you can find on the web, no one possibly has permission to do that. And yes, you can't distribute the data. But in general, after they do that fetching, they will go for some sessions of mechanical torque or stuff like that to annotate stuff. No, not necessarily at all. A lot of methods use completely unannotated data. Yeah, well, in natural language processing, there's people still annotate stuff if you want to get labels and stuff. Yeah, but again, the most promising techniques recently have not been done. Oh, but that's just a research bias. I'm talking about what the papers I saw three weeks ago in the NLP conference. People still use plenty of supervised learning. Yeah, yeah. But again, I'm saying it is a problem for some categories, safer voice recognition. Say if you really care about parsing syntax trees so you can draw a nice structure of your sentence diagrams. But I think there's a lot of, you seem to be suggesting that for all machine learning methods here we would have a problem for this in Debian. Yeah, I mean the thing is you have to see also that engineering tracks research maybe 10, 15 years or even more. I mean, most people when you say machine learning, they think of neural networks when nobody's using neural network for the last 10, 15 years. But I would put the neural networks because I wanted to communicate with the people in the audience. And maybe one final thing before I shut up. On Daniel's point about the font packages, I don't believe that's a precise parallel to transferring around a model that you join because what you're talking about for the kerning question is a magic service that does something and that you don't really understand. However, there is a machine learning parallel for this which is that Google have recently, well in the last few years had some service where you can throw data at them with labels and they will give you back a model that predicts the labels. And in this situation that is pretty scary. You have no idea what method they're doing. I mean, even if you didn't care about freeness, in fact, next week they might completely change the method which might improve your results, might make it much worse and so on. So that is a thing we would need to be alert to to watch that I think that kind of, if you have the kind of model that came from that kind of service then in my opinion that probably just effectively probably is a non-free model for the reasons that you're picking up. Although I think there could be other types of model where it's okay and it is more similar to the kind of distributing a JPEG okay question. So I wanna push back a little bit on one of the things that you said. You said you were saying it to be provocative. So I'm provoked. So it seems like you're saying it's not that bleak. There's a lot of machine learning techniques that don't rely on these closed proprietary datasets. They just pull random stuff off the web or whatever. But our build these don't have network access and even if they did have network access we want repeatable builds. So either we don't ship any models at all in that context and then it's the user's job to build the model from their own network connection later which is particularly un-useful if someone wants just something that will do voice recognition or something that will do analysis of internet text or we have these weird non-repeatable build situations which is and you can't actually get that same model the next time you build the package and that's not acceptable from the Debian archive at the moment. And we also want legal builds. These companies can do that because they keep it secret. Yes, if we publicly come and say we are distributing this file that is a derivative from this blog that says copyright blah, blah, blah and there is this log on the Debian build that say well we fetched the log of this person we may get just you. Yeah, I have a few points on this ping-pong thingy. One of them is that with the font and it's models generated by data and so on there's two trends that I think are really important and one of them is the blurring between the data and code line. And the other one is machine generation possibly distributed. I think the IP model is very based on humans generating stuff and specifically humans generating code and code being very well separated from data. So as these lines blur I expect the IP model will be increasingly strained and obviously not work in very weird ways. And the intellectual property. And about the builds. I think obviously yes, if you're making a model you can't expect the build machines to make the model for the build and what I expect is if we start distributing models like this and they're being automatically generated for builds, what I would expect is that you would sort of have a subsidiary system that is in charge just of generating models which are evolving and when you make a build you just ask that subsystem for the model that you want to include. Yeah, just on this question again. While I'm saying I don't think it's bleak it's also partly because I think doing anything other than just saying models is okay is utterly unrealistic for some of this because basically if I go off and collect some web data to train a semi-supervised model say then the data in a modern system the kind of data size that you would be looking at is minimum terabytes. And I really do not believe that Debian is going to decide as a project we're going to insist on distributing around terabytes of data for particular packages. Oh, I forgot a comment about meaningfulness. I think any model is inherently understandable. The question is, is it easy to understand and is it made to be easily understood? I think we have, do you have a question? Okay, so from my experience the worst part about... Talk closer. Yes, the worst part about machine learning is always to find the correct dataset. Now we are talking about not only finding a dataset that will do the job for us we're talking about the dataset is that also has a suitable license for us. So for me that is first a much more difficult problem than maybe is understood from your presentation. Second, if you get to do that I'm sure many most researchers around the world are going to thank you eternally for that. To me that sounds like a really huge problem. But the issues of caring about the license is corner store to Debian and many other users and distributions thank us for that. So sure it makes our task more difficult but as a task we have set ourselves to do in the realm of software. Of course I'm not arguing to that. It would be a very, very good thing. But it's, I don't think it's feasible within a reasonable amount of time. That's what I mean. Why? It could take us a long while to get there but once we get there we will keep improving. Yes, and you will keep improving. You will have many more people than just the Linux world. Okay. Just coming back on the understanding, understandability of models. There is a big difference because for example if you train a support, I mean, I don't know if you had your nice picture of an SVM because probably no one else in the room remembers what an SVM is. Yeah, people show you nice pictures like this and in two dimensions it's kind of understandable what it's doing but in fact when you've got your kind of 10,000 dimensional data or whatever it is, you end up, the way an SVM works is it ends up choosing an effectively arbitrary set of data points that define this separation line and there is no fundamental meaning to those ones. It's just, it's found something that happened to work well with your training data but if you move one point by a fraction you get a completely different answer and so on. Whereas some other methods are much more understandable in that they're actually giving you statistical properties for example if your data that you can actually update it. And you have a updatable model. In some other types of models, generative models for example, every individual number that your model has would actually have a meaning and maybe thinking about all those numbers together is difficult but you can still, it's still impossible, it is possible in principle to dig into it and work out what's going on unlike with an SVM. But do you think you can actually modify it in a meaningful way? I mean unless there are models that you can update like an updatable database or stuff like that. Well it obviously depends what your task is, how complex that is but yes I mean if you have got something like a hidden markup model which is heavily used in speech recognition then yes I mean I don't, I think knowing what to modify and having a good idea about it would be pretty tricky but it would be straightforward if you somehow knew what was a sensible thing to do it would be very straightforward to modify it. Yeah that's the speech recognition final user adaptation that's in a sense so yeah. I mean again just to, on this point as well when I'm making the distinction between the model I think this is a genuine thing we should think about and maybe when we choose models in Debian for free software projects we should try to choose ones which are interpretable because for example often people love SVMs because they're a black box that there's lots of C libraries you can throw things in and you get an answer out with good results but there's another model I mean other models say logistic regression often gives almost as good results and basically the same result but gives a model that you can interpret so if you're trying to publish a research paper that gets 1% better than last year's paper and gets you into the conference then obviously you care about that tiny detail but if what we really care about is free software and that kind of thing then maybe we should also be preferring the models which making an active decision to prefer models which are interpretable and therefore avoid this problem. You have something. Very quick. You'll have machine learning about machine learning soon, ah recursion. Why would I need to understand the model itself? Yeah I mean it's like the source code of program okay you understand it but the binary you don't need to understand it. For me the model is quite often yes you won't need to understand the model because you just know it works you see the results. I don't see what the need to understand it is connected to the point of being free. And in that note we can continue the discussion on the mailing list offline. Thank you so much and thank you for not flaming.