 So welcome to the Fossil 2017 distribution dev room. We have Bruce Sliton presenting source code. Are we not forgetting something? Thank you. Yeah, I'm a devian developer myself. But the talk will be relevant for any distribution. Maybe also for people who have their own upstream projects and want to provide tar balls or whatever for the people to use. So I'm going to talk about source code, of course. So what is it and why do we want it? Just a reminder of things that we should already know. Then I'll talk about different kinds of files. So when we talk about source code, we are usually thinking about something within a programming language like C or Python. But there are many more types of files, and they all have their own quirks maybe. Hopefully I have enough time. And then I can go into the appendix at the end and go through some file types and more details. But I will also talk about the problems that we have with source code and some conclusions that I draw from them. I don't think I have answers for all the problems, but it's just a wake-up call for some people here, hopefully. So first I will talk a little bit about what source code is. You probably all know it, but sometimes you want to have a good definition. And the one I think is very good is the one from the third version of the GPL because it's very succinct. So first it says that the source code is a work. The source code for a work means a preferred form of the work for modification. And object code means anything that is derived from this source work. Notice that here it says a work, so it doesn't really say specifically that it should be a program or executable. The GPL can apply on, you can use it for anything in principle. Later on it has another remark. It says the source code for work in object code means all the source code that we need to generate, install, and run or use the object code and to modify the work. And this also includes scripts to control all these activities. So basically what they mean here is it's not just enough to have the source code itself because you also need, for example, a compiler. You need maybe the make files, everything that you need to produce the end result. Okay, but most of you are working on your laptop or computer and then you're only using the end product. So why is the source relevant? What happens here? Okay, sorry, something goes wrong here. Technology, let's see. Oh, I see. I'm missing a part of the screen. So we want the source code because there are several reasons. So one is that we are required by the license of the work to include the source code. And another is that some distributions required, notably Debian and Fedora and many of their derivatives. But the main reason why we have an open source community is because we want to be able to fix bugs in programs that we want to have new features that we want to learn from things so that we can make new products, for example. So just an example, the Debian Free Software Guidelines as their second item say that every program must include source code and must allow distribution in source code as well as compile form. Well, it really says program there but most of the Debian developers, they agree that this does not only apply to source code, it applies to everything. And of course, as I said before, you also want to have all the tools to be able to build the actual program. So Debian thinks these should all be free and have their source code of those tools available as well. So I already said that one thing is that we can learn from source code, we can study it, we can fix bugs, we can, for example, port programs to different platforms. One of the great successes of Linux is that it was one of the first operating systems that was ported to 64-bit Intel architecture or AMD architecture. Because we just have a new version of GCC which supports 64-bit, and we can recompile everything from the source. It took a bit longer for Windows, for example. And the other thing is that if we cannot do something ourselves, if I'm working with some program, it's written in Haskell, I don't know Haskell, it will take me forever to figure out how it works, how to fix a bug, but I can just say to someone else who knows Haskell, here's the source code, do you know what is wrong? And without the source code, it cannot show anything to anyone else. So a list of file types, executables, that's one, that's a trivial one, but there's a whole list of things. So we have manuals. And you can think of, okay, manuals, just a piece of text, but for example, manuals can come in the form of a PDF file, but nobody is writing the PDF code directly in their text editor. You use a doc book or latex or something else, and then you compile it in a way to the result. We have markup languages, CSS and Glade. So if you want to style your website or you want to provide a user interface for a program, you have all these languages that you do, you can use for that. But sometimes you're also auto-generated. So for CSS, you have this, what are the tools called? I forgot. Glade, we use some XML files, but you have all kinds of tools that can be used to generate them. You have translations, I think of, well, the get text. It's a very funny one, because you have the .po files, which are the files where you actually write the translation in, but that is partially auto-generated and partially you edit it manually, and then from that you generate a binary called .mo file, which gets installed and used at runtime to provide translations for your program. Fonts, you think, oh, we have a .tf file, a true type font and doesn't it contain all the nice factors to make a very nice outline of the font and render it? Well, yeah, but a font designer actually doesn't just draw some lines. He has all kinds of guidelines and rules, a library of curves he uses to create a systematic way of, well, the glyphs of the font. You have all kinds of multimedia, like images, sound, music, even movies. Databases, think of, if you use Stellarium, for example, it's a very nice open source program to simulate the view of the night sky. It uses a database of star positions, and these were recorded by NASA, for example. They are in a certain format, but they have to be pre-processed before it can be used. So there's all these things, and some of them have source code. Some of them are their own source. If I just, well, make an image in GIMP and just save it as a XCF or PNG, maybe if it's a single layer, then that is its own source, you can say. But if you have something more complex, like you have, you use Inkscape, you have vector graphics with many effects, then the PNG that you produce from that, for example, is not really a source. Again, the source is the preferred form of modification. That's another issue, because sometimes it's not really clear what the preferred form is. Hopefully I will get to that later, but I have to go through all the slides. So this is a list of problems, so that the header was cut off, that you can have a source code. So what I said, it's not clear what the preferred form is. Sometimes it's too big. So for distributions, we only have so much space on the mirrors that host our packages. Blender has a competition where people can make movies, or this gets sponsored and they make them. These movies, well, if you download the end result, it's maybe a few hundred megabytes. That's already quite large. But all the sources used to produce that movie, wow, that's multiple gigabytes. So distributions cannot, for example, ship this. Sometimes it also takes too much CPU power to compile something. Again, these movies, they have to be rendered, for example. You need a compile farm and weeks of CPU time to actually produce the end result. Sometimes the compiler or whatever tool you use to go from source code to object form is non-free. Sometimes you have people who wrote something, and then they lost the source code. So you only have the end result. What do you do then? Do you just throw it away and say no? We only accept things that have their source code, but sometimes it's so useful, and it was free in a way, so what do you do with this? You have issues where sometimes the author says, oh, here I have something, it's GPL'd. And I throw it on some FTP server, and it's out of my hands, and then somebody asks, oh, I'm going to write the source code, and they say, no, I didn't distribute it. All kinds of things can go wrong. Some authors, they use the wrong license. Case and point, for example, is Resnald. How many of you know this game? Okay, it's an open source game. It's a strategy game, turn-based. It has a very nice soundtrack. And the game is GPL'd, and what is nice is that the author said, okay, all the resources of the game should be GPL'd as well. So not only the executables, also all the data files, images, and so on. Then somebody produced a nice soundtrack. They have their soundtrack in opforbis format, I think. And there are no source files for this. The soundtrack is made using software synthesizers and computers, and it's rendered. But the authors of a large part of the soundtrack just don't want to give it, and they say, no, the org file is the source. That is really strange, because this is not the preferred form of modification. But Resnald's people say, okay, then it's fine. So that's a big problem for us. For anybody who wants to learn from this and wants to change something in music or learn how to write music themselves. Sometimes there is a dependency hell. A case in point is WordPress. It installs some minified JavaScript files that is used when you view a WordPress site. The JavaScript files are generated from source, which is available, but you need grunt or some other tool to process this and make the minified file. It's actually quite complex. Grunt, in turn, depends on Node.js packages. So you need a whole bunch of dependencies that are... Well, Debian had this problem, where suddenly they had to add, like, hundreds of new Debian packages to the archive just to fulfill the requirements that all the tools used to build this minified file are in the distribution. Wow. So what should you do? Either if you're a package maintainer in distribution or if you're someone who provides an upstream project themselves? Please ensure that all the source code is available. If it's not, then try to find out what is not available, but if you're upstream, just add it. If you're a distribution packageer, then try to work with upstream. Most upstreams are really kind and willing to help you. But if that doesn't work, there really are people out there who don't get it. Then, yeah, you have to push back in some way, and maybe one option is then to just say, okay, we're going to remove your package from our very popular distribution. But remember, so you have to be really reasonable about this. Use your common sense. Sometimes we have these problems that I mentioned earlier that maybe cannot be solved quickly or in an easy way, but then don't just say, okay, I'm storing away all your work because that is not helping anyone. In fact, that would create problems for end users, so don't do this. Well, here I have an example of an executable. I don't have much time, but I'll go through it quickly. So here we also have much more than meets the eye. So you think, oh, executable, I have some source files in some programming language like C. I have a compiler, GCC, and it makes it into an executable. Simple, right? No. You have all kinds of things going on. So in the top left corner, it's all the build scripts. So you have Automake, Autoconf. But you have the source files for your configure scripts. You have source files for Automake. This all gets compiled by the auto tools. But then you have the configure script itself. That is run at build time for your actual program that produces the make file. And that one is then actually run to tell GCC to compile your C file. But that pulls in header files, libraries. Your C file, if you have translations, needs to be pre-processed to provide this .po file for get text. Then you have to edit it by hand. And then finally you'll get the binary, which is read at runtime. You have icons, images, your user interface, maybe written Glade fonts. Everything here is needed to make sure that your program runs and actually is useful. So next time you look at your own program or your own package, then think about all these things. So some other things. Sometimes it's not really clear what is source and what is not. Sometimes things are your own source. So for C, C++, and so on, compile languages, we are really sure what it is. For scripts, if something is written in Python or Bash, we think the script is its own source. But sometimes these above things are written by other programs. So they have lexers and parser generators, user interfaces, maybe created in Glade or Qt Creator that in turn can produce C or C++ files. Sometimes it's minified, like I already said. So, yeah, think about that. Oh, I was already in the panics. So documentation, man pages, you can write, when you write a man page, it's mostly its own source. But you can also have it auto-generated. For example, with Pearl, you have pot to man. You have programs like DocBook or Pandoc that can translate from one format to another. Info manuals, they are usually written in tech info or they're also produced by DocBook. PDFs, nobody writes PDFs from scratch. You always use something else. HTML, that's another case where it can be its own source or it's generated by something else. And even if you write it yourself, then maybe it pulls in CSS files or other things. Fonts was also a big discussion in Debian some time ago. Mainly because a lot of packages included true type fonts. But these fonts are created by, usually, font forge and they had the font forge files missing. But there are other strange things. One thing is, for example, that fonts can contain executable code. So there is, if you want to have a nice sharp font on a low resolution display, then you want to make sure that all the pixels are not blurry. So if you have anti-aliasing, you actually want the pixels to be aligned to the, the shape to be aligned to the pixels because then it doesn't blur it over multiple pixels. That is a tricky thing to do. And the true type standard has so-called bytecode that you can include in a font that at runtime tells your font rendering engine that, oh, you have to shift something a little bit. This can be written by hand, well, probably in some programming language. Or it can be automatically generated. So if you get the true type fonts from a commercial provider, as usually they have written this code for you. But in the open source world, we have the TTF auto hint package, which nowadays provides this bytecode. For images, it's also very interesting. For example, if lossy compressed images like JPEGs versus lossy compressed images, PNGs, and one thing I said before, be reasonable, source code doesn't mean that something has to be perfect. So a lossy image is perfectly fine as source code. There are actually people who make their paintings or drawings in a program like GIMP or Krita, and they just save it like a JPEG and then really didn't continue editing it. So there is no real lossy, lossless source for some images. If you take an image of your camera then from some kind of scene, what is the source? Well, the source is not the scene outside there. That's the analog world. We are not concerned with the analog world. We are only talking about bits and bytes. So whatever the camera produced, like JPEG or Warfile, that is the source in that case. You can do destructive and non-destructive editing. So if you use GIMP and just draw some lines over some other lines, then once you save and then do history is gone, then you lost the history of the whole thing. But that's okay. Of course, if you're working in something that stores all the information so you can recover, that is nicer of course. Okay, for sound and music, I think this is the last slide. You have the same as with images. You have lossy and lossless formats. If you have recorded audio, for example you go to a concert and you get permission to record the music they play. Then of course not the people who are actually playing are the source, but again the audio, the recorded audio is the source. Module trackers, that's interesting. It's from the Amiga time. It's a nice format because it's its own source code, which is really weird for music because that's kind of special. But mostly nowadays people are making electronic music with software synthesizers. So you have some kind of music score. Then you have some samples or instruments that are generated electronically. And the whole setup of how everything is connected to each other and how filters are applied, for example, is in some kind of the script file that describes how this is done and then you compile your music, so to say. So everything that you need to do this is source. Yeah, that was it. So I hope I made some people more aware of issues around source code and think about what kind of source code did I miss. So do you have any questions? Any other projects besides Debian and to a lesser extent Dora that are actually making an issue out of this? Well, I'm guessing you can go to the Free Software Foundation website and there is a list of distributions that are actually completely free and try to have the source code of everything. But those are really small distributions most of the time. So I just mentioned the big ones, but I actually don't know by head who else is working on this. I think you were first. So have you thought about the fact that this is about to get a lot worse with pre-trained machine learning models? Sorry, pre-trained machine learning models and the training datasets and training environments that will be generating those? Are you talking about machine learning neural networks that kind of stuff? Yeah. Okay, so we have to repeat the question. So what about when you have neural networks that have to be trained to perform some tasks, for example? Then what is the source code? That's very interesting. Maybe you have some kind of seed that you use to start your training process. Then the compiler is actually the learning algorithm that's running for some time. And then it produces a neural network as output that is the object code that is run at runtime to do interesting stuff for you. That's huge, isn't it? Usually the material that trains neural networks is... Yes, that can be huge. I would say the biggest problem is the CPU time you need for this because even for simple things I know this can take lots of CPU hours to produce something that works. Sorry. Thank you.