 Welcome Janssen with this generating IOCode for Falmo from Astork. Good morning everyone. Welcome to the Sunday lecture. Yes, I will be lecturing. It's Sunday morning, that's just a coincidence. I had a bit of deja vu just now because about 10 years ago I also gave a talk just after Mokko when he was talking about Threadweaver. So that was kind of interesting and it's good to see that the library is still going strong. But now we go to a very different topic, namely generating IOCode for file formats. And in my opinion, this is the main message of my lecture today. Do not write the serialization code. Just don't do it. Why? Well, first let's talk about what is the serialization. Yes, everybody knows you have a file which is a series of 1s and 0s and you can read them and write them. And that was the end. I think this connector is a bit loose. Yes, I should hold it. Try not to move. I'll be careful with the set. This turned from a sermon into a comedy act, I'm afraid. I studied it with a phone. I hope that's good enough. So you have a file with series of 1s and you read it into a runtime structure in your program and you can write it out and then you serialize the data into a file again. So in other words, instead of saying you shouldn't write the serialization code, you could also say do not write, read write code. But that's already a bit harder to pass because there's two writes in there. So you can see that even for simple sentences, serialization can be tricky. But why shouldn't you do it? Well, first of all, there's security. And we'll see some example of how security is problematic when you write your own IOCode. Efficiency can be higher if you do not write your own IOCode and it may sound counterintuitive, but I think it's true. The code will be more maintainable, more readable, more consistent. And on top of that, porting will be easier if you do not write your own IOCode. How can we achieve something like that? Well, first of all, why would we want to have higher security? I think that almost all security issues today are related to IOCode. So you're reading some file and there's a problem in your reading code and there's a buffer overflow or there's something else going on because somebody can give you a file which is malicious and exploits just bugs in the handwritten IOCode. So yeah, there's a good reason to fix stuff. So what's a buffer overflow? We had a very famous buffer overflow a while back called Heartbleed and here's a very simple XKCD explanation that basically says, somebody asks the server, can you send me a message back that is HAT? And by the way, HAT is a 500-letter word. Yeah. Obviously HAT is not a 500-letter word but because the server didn't check that HAT was really 500 letters, it sent 500 letters of a different space in memory. So the person talking to the server could actually read chunks of memory out of the server very, very easily. So a huge, huge bug because of handwritten IOCode. Can we do better? Yes, of course we can do better and there's two approaches basically. If you have a new format, a new protocol or just a new file format, you can use existing tools. You could use Protobuf or Debus, XML based on which is binary JSON, CSV. I listed these in decreasing order of niceness. The Protobuf actually maps quite well to runtime data structures. So there's Debus, XML does not really and neither does JSON and CSV doesn't map nicely at all. That's actually, maybe that's not a good solution to put on the slide at all. Okay, but what if you have an existing file format or an existing protocol? You can't just choose a tool which maps into it because, well, many final formats were designed by somebody who wanted to get the data in a small amount of bytes and didn't necessarily think about how to actually read that into your computer. And usually it's specified readable for humans. So the human has to then write code to read this file format. I don't think that's a good idea. So you can do better by writing a computer readable specification. And I'm going to show how that works in the next slides. And then from that specification generate the code instead of writing it by hand. And your program will look like this then. You have a file and the file format has a specification. You have generated code which is fairly safe, at least much safer than the handwritten code. And therefore it forms a security barrier to your application logic which just deals with the runtime structure. So, yeah, your application should be easier to write, easier to maintain, and it should be safer by working in this way. Well, let's come to a concrete example. PowerPoint's binary format. That's most familiar with that format here. Nobody, well, obviously that's, I can understand that because why would you be interested in this? But it's a very, very complex and large file format. It has 500 different structures. It's documented in two huge PDF files. It has 650 pages approximately each. So more than 1,000 pages in total. And to be able to read that file format, you need a huge amount of code. So the documentation is extensive. It's quite detailed. But also it's not always in accordance with the files you find in the wild. Yeah. That's partially serialization bugs. But also there are more programs than Microsoft Office trying to create PowerPoint files. And there was a time when the documentation was not released and was actually guarded very well. And still people were trying to write PowerPoint files, but they were just guessing what should be in them. So, yeah, there are files which are slightly different than the real format. Yes? I've seen situations where the people who wrote the spec for the file format, for the data format, and were the only people who were willing to live in that format still didn't know what the old spec was. And the spaghetti machine we written in Perlter Digest had to be adapted to reality. Yeah, I'll just repeat this comment. Here's an observation that the people who wrote their own specification and were the only people implementing that specification still didn't always follow it. And, yeah, I think that's just life. It happens, so you have to build and save cards for that. Now, why am I talking about the PowerPoint binary format? Some years ago, Nokia wanted to have smartphones and they wanted to have an office suite on their smartphones. K-Office, at the time, these days it's called Caligra, was very nicely designed office suite, and nice and small, and it should be able to run on a phone like this, but it didn't have any support for PowerPoint files, and obviously that was a showstopper, so it had to be written from scratch. And I was involved in this project, so I was faced with all these huge PDFs and had to come up with a solution to actually read all those different structures. Here's an example of what this documentation looks like. This is a very basic part of the documentation. It's only a couple of bytes, so there's a record header, and nearly all of the components in the PowerPoint file have a header like this, so it's a very important one. It has a version number, which is four bits, then an instance number, 12 bits, then two bytes for the type of the record that follows the header, and then the length of the header. So it's fairly simple, and I'm starting with it to show you how you can generate code and how you can document this in a computer-readable way. So this is what I came up with, very ad hoc for this particular problem. I just had a bit of XML which said, I have a struct, the name is record header, and here are four data components in this thing. And yeah, it has UN4, which means it's four bits. So fairly simple, and generating code for that is also very simple. So you have a class which inherits from stream offset, but that's just for debugging purposes, and it has data components, which in this case are larger than the actual data reading, but that could be improved upon, actually, now that I think about it. But the code to read all of this is below it. There's an input stream which can read on the bit level and just writes into the data members. Very simple. But the file format is very convoluted, so you can get more complex situations, and here is a mildly complex one. This is for putting PNG files in your PowerPoint. It starts with a record header, that's the RH at the top, and then it has some RGB UID and RGB UID2, but the second one is optional. First complication, some things can be optional, and then at the end it has a blip file data, and that's a variable length. Now, how do I know if the optional thing is present or not? That depends on the data that comes before, so there'll be a bit set which identifies if this component is present or not, and the length of the file data also depends on the data before it, but the relationship is quite complicated, in fact. If you look at the rec length, so the length of this record, it depends on some number somewhere in the record. You don't need to understand what they are, but you just need to understand, weird situation, this is probably going to go wrong if I write it by hand. So, we can capture this in XML, so the first record you already know, the second one basically puts limitations on what can be in the record, so it says the first member of the struct OfficeArtLip should be a record header, but it has to have a version number of zero, and the instance can only be two numbers, so the pipe in there means either this number or that number, and the type should be exactly this number. So, when you're reading and you don't fit this limitation, the parsing will simply stop and say, I don't understand this data. Another use for this is you can have, if you have an array of objects of which you don't know the type, you can look at the value of the type to see if this structure matches the data which follows. And then you see the RGBUID2 member, it has a condition on it, it's only present if REC instance is equal to a magic number over there. If it's not, then the parsing simply will skip, will just continue reading the following data structure. And at the end, we have a variable length amount of data, and there you have actually a small formula which gives you what the count of bytes in this byte array is. And it's also a fairly complex statement. Still, it's relatively readable, even though the structure is complex. It's certainly more readable than the code which is generated from it. This is the data structure, so that's still fairly simple. It's just a translation to C++ with Qt of the data. But if you have the parsing code, that's getting long, quite long already. It's Sunday morning, so I don't expect you to read all of this. It's just showing you that you don't want to write this by hand all the time. And if you were to write this by hand, either you'll have many bugs or you'll have many unit tests. So the PowerPoint binary format example. Lots of different structures, and when we were done generating, we had 30,000 lines of code in CoreLiga just to read the PowerPoint files. And these were generated from 6,000 lines of specification in XML. Now, even that specification can have bugs, but if you're debugging, at least that's fairly readable. And this is the kicker. The code to create the 30,000 lines of code is only 700 lines of code. So that's a very doable number, right? We have this huge complex file format, but there's only 700 lines of code which determine what is actually written and how, sorry, how am I reading my file? Now, if you go back to this generated code, the way I'm reading it here, I'm using a particular type of buffering, I'm using a particular class to read the stream, all of this could be self-optimal. I might think about improving this, but if this were handwritten code, improving this would take forever. However, if I generate the code, then I just have a few instances where I change out a few things and then I have suddenly very different code. So, by working in this way, you can have a very quick turnaround of ideas for your reading code. For example, I implemented a reader which doesn't do any allocation. It might be faster or it might not be. It might be convenient while working or not. It's very easy to try that. What I also implemented at the time was convert to XML. This was just ad hoc XML, which allowed me to have a binary format and quickly translated to something I could read while I was debugging. And once you have that, well, you can also convert it to SVG and you can try to visualize your PowerPoint slide very quickly. It wouldn't be production code, but at least while developing, you can quickly work with. Also, I wrote a version that generated introspective code. So at every point, I could list the number of entries of every record and see what was in there. That code actually was much more than 30,000 lines. It was over 100,000 lines when generated and it wouldn't be efficient for a calligrate at all, but it was very useful while writing, while translating this documentation, this handwritten documentation for humans into the computer-readable XML file. And as a debugging tool, you could also do round-trip code generation. So go from binary to XML and back and check that you have the exact same binary. What that allows is just downloading thousands of PowerPoint files from the web and just seeing if they're all round-trip and if they don't round-trip, you have a bug somewhere. Probably you should either fix the specification, well, actually you should fix the specification or maybe there's a bug in the generator. And usually it was a bug in the specification because the generated code is very small. And that's how we found out that there are actually things that are not documented properly, just bugs in the documentation, and we managed to work around them and fix the XML description of the document. So those are the advantages. And here I was using this in an ad hoc way for my particular problem, PowerPoint files. That's not always the optimal solution. But you could extend it. So this library is available separately. It's quite old by now. But it could be extended to also read PDF files or zip files, MP3 files. And actually the PDF would be a fairly large undertaking. I tried it and I didn't manage. Another use for a type of library like this is to convert the files to RDF and perhaps use it in desktop search. This is also an unfinished project which I tried at some point. And yet another use would be to use this for the document liberation project. The document liberation project tries to document old file formats so that in the future we can still read them even when we're not using C++ anymore, for example, or when the binaries don't work even in any emulator. Like I said, this is ad hoc and there are people who are doing this more seriously. One project was to have a specification for doing this and that was called the Data Format Description Language. It is inspired on XML schema, which does this for XML. In 2011 it was published as version one and you can pass files if you have a schema. The implementation is in Java and it reuses the types from the XML schema specification. So that's nice. There's, however, one big downside. This project has a very nice specification but the implementation doesn't do too much. So it doesn't do any code generation. It just does live inspecting of documents. So there's a Java program which reads the schema and then can also read a binary and that's what it does. Another more recent approach was some guys wrote a scientific paper and published some code and they use this as a specification format and they show that you can have a parser for zip files and they also implemented a fast and safe DNS server by writing down the DNS packet specification in this form. So that was a nice paper to read and a nice project to do. I'm not sure how far it will go because it was academic and academics, once they prove a point they let everybody else figure the rest out. But a more commercially interesting approach was presented yesterday. No, the day before that, which is by Hickenhack Software. This was a lightning talk and they are working and have version one of something called Protler and this is intended for network protocols, not file formats, but in principle it should be amenable to something like that as well. Yeah. And Andreas who's sitting there can answer any questions if you're interested in how Protler works. Okay, so this was binary files but what about XML? That should be safe, shouldn't it? Maybe. What about this SVG file? So it's perfectly fine XML but on unload it has a handler which calls some JavaScript. So anything can really happen and if you just have a general XML parser and put this into your DOM tree well magic can happen for somebody else perhaps. So this is not always safe so you still need to be careful with XML or if you're writing XML, writing XML code by hand can also be problematic. This is a code I found a few weeks ago similar to code I found a few weeks ago in an unnamed office application. It was a very serious bug. It was generating in, so not well formed XML under some circumstances. Yeah, because the if statement for the open tag and the if statement for the closed tag is different. So yeah, I was very shocked when I saw this. So I think also in this case you should just generate code that just writes the entire XML and you shouldn't be passing strings to start element functions at all. In fact here's some real world data which I got from Italo Vignoli who is also at the conference here. He analyzed security vulnerabilities on LibreOffice and Microsoft Office and there are lots of them and these are mainly actually in the XML format not the binary format. Surely this big red bar is from the binary format. No, it's mostly the XML. So the data comes from the CV database and Italo read the descriptions of all the vulnerabilities and categorized them and you see that the lower three are actually related to files and the top ones, well actually the font one is also related to files of course but the lower three are the most interesting ones and also the largest ones. This is collected over three years. So yeah, over 100 vulnerabilities in three years. That's not too good. Yes. So for XML you can also do code generation. The most widely used tool is JaxP that's for Java. It reads an XML schema and generates classes for you and you can use the classes and that's of course if you're writing an MSO scheme I was also using that because well I believe that that's the way to go. For C++ though there's also a library. It's called XSD. It's developed by code synthesis. It's available under GPL and it generates C++ from an XML schema. Now even this last one I'm mentioning, libxml2 which can do validation and almost nobody's using that in applications because it's too slow but it does add safety, right? So if you're writing files I think you should also hook a libxml at the end to do XML schema validation. It should be fairly cheap if you just pipe it through. It does streaming validation. So sure you need to have the past XML schema in memory but apart from that I don't think it's too much overhead for nicely add a security. So we're at the end of the sermon. The screen held up for the rest of the talk and conclusions. I think you shouldn't write the serialization code directly, write a schema instead. And then if you need to write your own code generator because then you can adapt your IO code very quickly and you can move fast and you have more secure code. So the generator is short. It's heavily used so there are not many bugs in there. I think in general it's a nicer way of working. It requires some setup at the start of your project but when your project is meant to live a bit longer this is a much more maintainable way of working. So do not write the serialization code unless you write a serialization tool. Oh wait, there's a passing error there. Sorry, this is not intentional. Shit. Questions. Did you publish your code for the PowerPoint stuff somewhere? Yes. I like these questions. Oh, where? Well, it's on GitLab. It's also in the KDE Git repositories. So it's in a few places but that's the nice thing about Git. It's decentralized. And the actual generated code is simply committed to KaliGrid directly. But there's a note saying this code is generated if you want to change this code, download this tool, change the specification and re-run it and upload it again. There are differences between XML, schema and DFDL. So DFDL is meant for binary files. If you have an existing file format you can then reverse engineer or after the fact describe this file format with DFDL. And XML schema is meant for XML. So then you can describe which XML elements are allowed, which attributes are allowed, how can you nest them in each other. So it's similar in concept but one is for XML and the other is for binary data. I have a question. Assuming you already have a code base with classes you want to serialize, do you know any tool that will generate serialization code based on the code and then generate some format which can read and write afterwards? No, I don't. It sounds very complicated to do that automatically. I'm afraid you'll have to do it by hand. So DFDL, the question is, is that code to generate code from DFDL description? As far as I know there isn't. So there was a lot of work put into having a large interesting and good specification, but there's not a lot of code actually generating code from DFDL, unfortunately. But if you want to start documenting your own file format and then generating code from that, it might be clever to still use DFDL because first of all you don't have to invent your own language. You can just reuse that. And there is this tool where you can do debugging. So you can load a DFDL description in the Java tool that comes with it and then look at it. So there is a tooling available for debugging, but not for generating code already. Thanks for your talk. And from what I have read, the DFDL can only deserialize your binary format. You have no option to create a file. Well, yes, like I said, there's no code for generating code. So if you have a description of a format, obviously you can read and write code or create code to read and write because you have the description of what it should look like. But since they only have a tool which does inspection of an existing file, that's only for reading, yes. And did you write code that can also write the file? Yes, but since, well, the use case for us was that we wanted to open PowerPoint files in Caligra and so we wanted to show them and we wanted to be able to save them. But since we thought ODF is a better format, we converted them to ODF and we have code to save ODF, but we don't have code to save to PowerPoint format, yes. So the short answer, the genuine answer is no, but we do convert it to ODF, yeah. Just wanted to check, I've understood. You were describing the tool for DFDL as debugging. Is that a validation checking that an existing file does conform to what you think this back is? Yes. Oh, again, okay. Sorry, we're kind of running late. Just to conclude, so what we've seen so far in the wild, we are the only one who tried to generate deserialization and serialization code from one specification for binary formats of all kinds. So basically you can also do text files and binary files combined. And so far we are concentrating on network protocols because they are shorter and the generated code is not so sophisticated or adapted to your specific application because you can basically deserialize it in a structure. And for file formats, it's quite easily possible, but we have not taken care of that, so concentrating on really documenting it. Yeah, thank you. That was a comment by Andreas who wrote the POTLI tool, which I'm very impressed by. So by all means go check that out and I'm sure you're going to put the slides of your lightning talk online somewhere or maybe they're in the QtCon description. Yeah, so POTLI is a very... It already exists, so if you don't want to write the code to generate your code yourself, you can use that. Yeah, so check it out. Okay, so I really have to cut it right now because we are overrunning by almost 10 minutes. So thank you very much, Kios. Of course if you have additional questions, grab here many times and do it in the conference.