 Hi, my name is José and I came here today to talk to you about a program I have been working on by myself in the last couple of years. And finally I made it basically to do something useful and I was so happy about it that they published it like two or three weeks ago. And well, this is it, it's called POC and it is a sort of editor for binary data to which you can describe the structure and then edit it in terms of the abstractions that you are defining. I know this is not maybe not a DC2 grasp at first but that's why okay I am going to do a little demo and everything. So, first of all, POC is not finished. I mean you can already use it to do useful things. Actually it helps me a lot in my daily work but this is work in progress. Work that you are by the way welcome to join. I will give you no pointers about if you are interested in contributing at the end of the talk. So, why writing something like this? Well, this is an excerpt of a real of one of the many, many, many, many, many scripts that they have to do my work. I work on the GNU toolchain. I am a compiler hacker mainly. So, I work on GCC, on binutils, linker, assembler and what not. And then I find myself very often in the need to vandalize L files, an object files and libraries and executables so I can reproduce bugs in for example the linker. So, I find myself very often in the need to edit binary files that have some structure. For example, L files, for me it's very common. And I use things like this. I used in the past things like this. So, for example, using object dump to get the contents of some information about the offset of the text section of an L file with object dump, to parse the output, to somehow operate it with a shell script and then finally to use DD, the DD command to patch the object file or to get information from it. Ok, this works. Yeah, sure it does. But it sucks. Why? I mean, look at it. It's crap basically. It works but it is fragile and it breaks so often. Also, it is very specific, obviously. If I wanted to do something slightly different, I would need to write another script. Not good. So, at some point I was like, ok, you know, this is it. I'm not going to continue like this because my amount of scripts, it's increasing all the time, they are breaking all the time and I am investing so many times, so much time of my work, you know, instead of doing real work to maintain my infrastructure scripts. So then I decided back in 2017, during the summer, I was like, ok, enough. I'm going to write myself a binary editor, you know, that should be generic. And I did not know, you know, where I was getting into because initially, you know, I was like, ok, something simple, you know, it should work. So, you know, I mean, it took actually a while because initially it was like, ok, very easy. I want to be able to describe the structure of binary data, for example of L files, right? They have a header, they have the locations, they have this field, these other fields, things like that. So then it was ok. Most of the data I want to describe, usually it is described already by some C library, C header, you know. So it should be, you know, the way I want to describe the structure of the data, it should look like C structs, it should not be that much different. But of course I'm sure that you all know that C actually is not very good when it comes to describe physical layouts of data because everything is undefined, right? And then the C compiler can introduce padding, can introduce alignment, can reorder bit fields, you know, which are less than one octet, things like that. So it should be, ok, it should be C structs plus something, right? Something extra. Then after one month or a couple of months thinking about it, I found some existing stuff. One is called Datascript by Godma Bach, who is a professor from some American university, for what, which one, who wrote back in 2010, 2011 a paper about something he called Datascript, which is very similar to what I wanted, actually. But it was not that much satisfactory for some reasons I will talk about after. And then there is an aberration, which is called 010 editor, which is proprietary, which means that it is unusable and it's very bad for the freedom of everyone and we don't talk any more about it. So then I just spent a long time, you know, like saying, ok, how can I have a description language that at the same time it is flexible enough and at the same time allows me to edit data in a transparent way and so on. You will see it working now. And then, well, finally, I got something that makes sense and something that makes sense and something that me in my general stupidity I am able to implement. So this is the program, this is how it looks like. Now I just told you what poke does in a very abstract way. Probably you are still like, ok, what? So it's demo time, alright? I'm going to use poke very fast because we don't have time. I'm going to use poke very fast to poke at relocation in an L file, which basically corresponds to some real stuff that I have to do, you know, like often. Ok, so I can use not installed poke, you know? So this is poke, oh sorry, first we need an L file. We create an L file and then this L file it has a relocation. I'm using redelf, you know, which is part of binodils. This is not poke yet. Ok, so let's poke it. So I just open it with poke and what can I do? Ok, first I can take a look. This is the dump command, you know, that basically tells me the bit and bytes on whatever that is in a cell file. Ok, nothing very exciting yet. But I'm always talking about structure binary data. So what is the structure of this binary data? This is an L file. How can you define in poke the structure of the data you want to edit? Well, using poke with big p, which is a programming language, which happens to be a full-fledged programming language, where you can describe data and operate with it. So of course I have already written a file for elf, which is called elfpk. The files containing poke code I call them pickles. And basically you can see here that in a language which is poke, you know, you can define structs, right? You can define structs, you can define types, you can define things like that. I will explain this later, but very fast for the demo. So, here there is a struct which is elf64eHDR. This is the structure, you know, of an elf header, right? And you see here that you don't always, you don't only specify the different fields, but you can actually also specify, you know, like constraints. Like for example, this is a constraint which is an arbitrary poke expression that tells that the elf magic number should be like that. So let's poke it. First, I have to load the elf pickle. This basically, you know, like passes, you know, the elf file, this elf description through poke. Now poke knows about those types. So, again, dump. Well, there should be an elf file at the beginning of the file. How do I get it? Well, I map it. This weird thing with the hash b is an offset, which is zero bytes. I will explain more about it later. And it gives me, it gives me the value. Of course, I can put it in a variable. So it's a EHTR, it's a struct variable that basically contains the elf header that is at the beginning of the file. Of course, once I map a value and put it in a variable, I can access, you know, the different fields. And I can also, you know, update them. Okay. What happens if I try to map an elf file and start in the first byte of the file instead of the zero byte of the file? Oops. I get a constraint violated exception. Why? Because the constraints which define, you know which are defined on this specific struct, which in this case is the elf header, they are not satisfied with the data, this offset in the file. So then I get a constraint violation error. Right, exception. Okay, but I have a header. So what is our goal? To vandalize that relocation. How can I get to a relocation in an elf file? Okay, I have the elf header, I don't know how you familiar with this format, but you have the elf header, and then in the elf header you have a field, which is called ESHOF for sectioned header offset, which contains the offset in the file to the beginning of the sectioned header table. Okay. The sectioned header table, as its name implies, is basically a sequence of different things of header entries, right, of sectioned headers. How many of them? Well, it's also in the header. It's called SHNAM. So how can I get it from the file? I map. At what offset? Oh, sorry. What are the entities that I want to map here? Sectioned headers. I have this another extract definition here. So this is basically map this number of sectioned headers, you know, at this specific offset from the file. Okay, what happened? EHDR. There you go. So this is an array. I can put it also in a variable. So this is an array. The size of the array is that in bytes. It contains, okay, let's put it in decimal. It contains 11 sections, right? So which section are we interested in the one containing the locations, right? How can we identify that section well by the section, flux or by the name, for example? How can we get the name of an elf section? For example, let's pick the first section, right? Or the seventh section, for example. It's its name. What is its name? Its name in another file is not a real string. It's the offset of the string in the elf file, a string table, you know, that gives you the string. Okay, you see that this is not very nice, you know, like the format, but this is the kind of stuff we have to work, you know, like usually with. So usually in elf, and in many formats, it is a pin in the ass every time you need to know what is this string. Oh, it is in the string table. Okay, in what string table? Well, usually it is pointed by the header, blah, blah. So that's why in poke, you can also define functions like this one here. This is a poke function that given an elf header and an offset, it looks in the, it looks for the string table and gives you, you know, the proper string. So for example, you can call it like this. And pop, this is the section we were looking for. This is not by chance, you know, I mean, I have done this before, before the talk, I mean. So this is the section we are interested in. So we know that we are interested in SHTR7. Okay, fine. SHTR7, it has a name, a type, a flags, what do we want? The section headers in the elf files, they have a pointer, which is another offset, yes, it's always like this, to the contents of the section in the file. Where it starts? SHTR, where it starts in SH offset? So what do we want to map at SH offsets? We want to map in this case relocations, because we know that this section contains relocations. So there is a strike definition for relocations too. It's just five lines that you write, you know, to describe it. How many of them? Well, here you see one of the peculiarities that usually you find in object files and in object formats. Elf is not telling you how many elements in a section you have. It's telling you how much space occupies, you know, the elements in the section. Fortunately, POKE allows you to map arrays not only by number of elements, but also by size. And it does the right thing. So here you can pass this. SH size. So here we have an array of one relocation, because we only have one relocation. So we can do the like of my relocation. It is this array. And this is the relocation I want to vandalize. Let's do it. Let's put an add end of 666. Done. We get out, we do redelf and mission accomplished. All right? So this was the demo. Now, this is what you can do with POKE. Now you may say, OK, this was, you know, like very stupid. OK, maybe it was stupid. But you can do something slightly different non-completely different with a completely different object format by writing a pickle of 50 lines. And this saves a lot of time, at least for people like me. So, you saw here that I was using a pickle and loading it and, you know, and using sort of a language. The language is called POKE, right? And now I'm going to tell you very fast, very quickly the different characteristics of it, but only the interesting ones. What makes it different to other programming languages, right? First, the language has support for values, like any other, you know, language. You can specify integers in different numeration bases. You have strings which are null terminated. You have arrays. You don't have multidimensional arrays, but you can have arrays of arrays. And you have structs, right? Nothing special here. But then, let's see the first characteristic that makes POKE, you know, special. When I designed this program, one of the first problems I found was, should I make it byte-oriented or bit-oriented? Right? I mean option A. Ok, I'm going to make it byte-oriented. Why? Because 99% of the object formats around are byte-oriented, right? So when it comes to specify objects and think like, or sizes or whatever, should be bytes. Ok, fine. Cons of this approach. Well, that if you are one of the one percent, you know, who has, is unfortunate enough to have to implement deflate, for example, or any other bit-oriented format, then this program is not for you, I'm sorry. And you did not want that. Option B. Ok, wanted to make it general. But if I make it bit-oriented, it's going to be a real, real pain for 99 of the users because you can imagine. You know? I mean, you will get sick multiplying by it everywhere, right? So, I was like, ok, byte-speed, byte-speed, ok, bytes but not bits. I was getting crazy. But unfortunately, in Frankfurt we do, from time to time, I do with some friends, we are called, we call ourselves the rabbit herd, we do like hacking weekends. And then one of those hacking weekends, I told my friend say, look, ok, I have this problem. So then we brainstorm and then we come with an idea, which is united values. Which is in POC, you have, like normal, in any other normal programming language, like pure magnitudes, like 23, 23 watt, 23 nothing, 23. And also, you have only for offsets, what they call offsets in POC because usually you edit files, but sizes, memory, you have united name units, united types which are called values, which are called offset values. So you can specify something like this, like 8 bits, 23 bytes to kilobytes. That was the initial idea. Now this has many advantages and actually it's a pretty concept. But then I was like, but you know, to have a list of predefined units, you know, it's something limited. So why not allowing you to specify any arbitrary unit? So for example, this is an offset of 8 units of 8 bits each, 8 bytes basically. And this is 2 units of 3 bits each. Ok? But then I thought, well, why are they stopping here? Why can't you be able to specify offsets and sizes also in terms of your own types? So for example, this is a value, which is 23 packets, a packet bin, you know, that is the structure you are defining just before that. Of course, this only works for that structure in Poc, whose size is known at compile time, right? Because that's not always the case. But it's useful enough. So you can operate in terms of packets. Of course, these are the operations that makes a little algebra of offsets. If you add an offset to another offset, what do you get? Another offset. If you multiply an offset by an integral, but a magnitude you get, another offset. If you divide you get a magnitude, right? If it's like if you divide meters by meters, what do you get? A pure magnitude. And you also have the rest, the modulus, which is another offset, obviously. The offsets are beautiful also because it allows you, you know, to think in terms of units, like when you are doing like physics for example. So for example, how many, how many if you have, if you define this type packet which is a structure of an integer and along, like in the example in the slides, right? How many bytes are in one packet? Well, you divide. In one packet, how many bytes? Oops. Yeah, okay. 12 bytes. In poke, when you write a unit and in 23 packets, 276 bytes. If you write a united value, you can omit the magnitude if it is one. So if I write it like this, how does this looks like? Like if you are doing physics or whatever, you know, and you are working with units like in your maths or your physics, right? I think it's quite cute and nice. And also allows you to operate and do conversions without having to have size of and things like that, right? It's very nice. Also it has a very nice side effect which is that often in object formats and in object files you have fields which give you an offset in the file or the size of something else in some specific unit. Some formats, for example, this is in elf. This is the size of the section pointed by the section header. It happens that the size is in bytes. This is how you specify in poke a type or an offset type, right? So if you were using C, Python or whatever to edit your object file, you need to remember what is the unit every time before writing into it, right? And you have to convert to bytes every time. So if you are working in relocations, for example, and you want to write 10 relocations in this section from your C program or your Python program or whatever, then you have to remember the unit and you have to do the conversion. In poke you can just assign because poke knows the unit of the destination of the assignment and also of the source. So it will do it for you. So this was the first thing of the language which is different to probably what you have seen until now. Of course, this was the values. Now, poke has types. It has integral types, which is for sign integers and unsigned integers. Another thing that is different in poke than in most other programming languages. In other programming languages, normally you have integers which by default are, I don't know, maybe 32 bits, right? Or maybe 64 bits or whatever. Or in poke at an integer can be of any number of bits from 1 to 64 and actually plan to expand it to infinite number of bits. And I'm not talking about bitmasks or anything like that. I'm talking about proper values of 7 bits or 3 bits, 1 bit, 5 bits, whatever. And you can operate with them accordingly. So you can define those types using this syntax which I think is pretty readable. Also in the offset types you have offset types which is a proper value in poke. Offsets are a first class citizen in poke and then one string type because there is only one day for strings. And of course, you also have component types which is what you use to define the structure of the data you want to use. Arrays are picky in poke because to be honest, when I started it, I always thought, ok, arrays will be easy and it is structs that are going to be painful to implement and design. Structs were easy. Arrays were the complicated ones, surprisingly. Basically, in poke you have three types of arrays, the array types in the language. You have unbounded arrays which is an array of non-defined number of integers, for example, like in this example. Then you have arrays bounded by number of elements which can be constant, like two integers or can be variable. You know poke is a lexical-escoped plug-oriented language, so it has closures and you can do all sort of unspeakable things with it. Or it can be variable. Or as we saw in the demo, editing the L file, you can also bound an array type by size. So for example, this array can contain the same number of integers than that array, array of this type of an array of that type, which is two, right? But this one is bounded by size and that one is bounded by number of elements and we will see in the next slide or in one of the next slides that has an impact when you map it. And also of course, you know the queen of poke, destruct, all right? Which is what you use to actually define your data structures. I will go very fast with this and I'm very sorry but I have no time. First, okay, a packet, you know? Okay, this packet consists in a file or in memory of a byte which is a magic number of an unsigned integer of 32 bits which is the length of what follows and then an array of bytes of data length. You see you can use, you know, fields that has been read just before or before to define the data after that which is the data. This will be the typical definition of a variable length packet, for example. Also, although this is not implemented yet, you can pass arguments to the structs because a struct is also a closure in poke, you know? You can actually also define variables inside of it and functions. You can pass arguments that sometimes it's useful. Also, it is very typical in object files, in object formats that the structs, they have holes in it. You know, the structure of something it has holes in it, for example, a typical example, there are files who has two headers. One at the beginning of the file, one at the end. Or if you have a header and then you have an offset to whatever other data, you know? Think about, for example, a poke struct which is an extended to file system which is a header or a super block that points to a super block and then the super block points to the different super blocks. It is a sparse, there are holes in it, right? So, in poke, you can specify... I'm sorry, I don't have a pointer. In poke, you can specify at the offset of a field using what they call a label. A note, here you can put any expression that evaluates to an offset, right? And this is how I fix this problem of byte bit of byte. And this offset, for example, in this case is part of the struct itself. You see? It is very flexible. I know this syntax is very bad. But this is one of the syntax words that they will fix as soon as they can get rid of the bison parser I'm using at the moment and use a recursive descent written by hand. Because I want to use the normal C syntax of a label, you know, like a prefix and a colon. But for syntax issues, you know? And LALR grammars, it's not a choice right now. Also, you can have pinterest structs, which is exactly what we understand by unions in C. I call them pinned because it's like if the fields inside they are pinned to the same... It's like you know a little tree, right? Why? Because in a pinned struct, usually in poke, this field starts immediately after the first one, right? But if the struct you define it to be pinned, the different fields start at the same offset in iospace. I call iospace the file that you are editing. So it's like a C struct, right? So this is basically also from ELF and it's telling you that you have an ST info, which is you can interpret it either as an integer of 32 bits or as a 28 bits ST bind or 4 bits ST type. All right? It's like a C union. And then you may ask, why did you not call it union? Because poke has union types too. What is the poke union? This is a concept I got from that script and I really love it. Basically, you have seen that a poke struct definition is basically the specification of a decoding process because if you look at it from an astronaut point of view, poke is nothing else than from the normal process of decoding, computing with data and encoding back, basically subtracting you from the encoding and the decoding and you can focus on the computing. So when you write in poke struct, you are basically in a sort of a declarative way you are teaching poke, you know how to decode the data, right? The unions give you conditionals. So in the struct type, in the struct type, you can use constraints, which is to every field you can specify an arbitrary poke expression that contains calls to functions and whatnot, anything you can imagine, also mappings, although that is a very obscure, I am not sure I want to get into very much yet, and to specify constraints associated with the fields. So how do you do conditionals in poke in your data structures with unions? So in unions, you have different fields too, like in this one. So how does it work? Poke will try to decode every alternative in the union starting from the first one, and then the first alternative for which no constraint is violated is chosen. And this is recursive, you know? I mean, the constraint should not be immediately in the field of the union. The union can have a struct, we can have a struct and so on. Any constraint that fails that invalidates this union. So for example here you have an example which is from the tag format in MP3 files, artist's name of the song and things like that. It uses this format. So here for example you have an ID of this frame whose is four charts, the first one cannot be zero, then you have a size and then what comes next, it depends. If the first byte here equals t, then what comes next is depends of the value of size. If it is bigger than one, it is two fields, you know? This idea is string zero and then an array of size minus one. Otherwise there is an array of characters of size size which is called frame data. If the ID zero is not t, then it comes an array of size charts frame data. Now you may wonder how is this different to this? Well, it is different because you have this, this happens only if ID zero is not t. And this happens if ID zero is t and size is not bigger than one. I know it takes a little bit to get used to those unions but when you do, you know, it's quite nice. Ok, POC supports polymorphic types, so you can write generic code. It is lexical-escoped, you have variables and whatnot. And then mapping. And this is worth it to waste time on this. So I was, ok, map this, I map that. Ok, in POC you have variables, like this bar A, like for example, this is an array of three elements, you can access them, right? But if I open a file, I can have also, I map three integers at this offset. So B is also an array of three integers and A is an array of three integers. So what is the difference? Well, the difference is that A is not mapped and B is mapped. So if I do A equal ten, ok, change the value of the second element of A, but if I say B1 equals ten, I change it in the variable as well, but I change it, you know, in the IUSPACE. So it has a side effect. So B has an offset, A have not an offset. So for example, you know, ok, this is it, right? So the central idea of POC is that you should be able to work with normal not mapped values and mapped values transparently. So you can write, and actually you can do it, you can write, I don't know, a function that sorts relocations and you can sort an array of relocations in memory, like in normal variable, but if that variable is mapped in some file or some memory, it will also sort it, you know, in the back end. This is that sounds very simple. This is what took me, you know, like months, you know, to actually get it right, because it was a schizophrenia, you know, ok. What is mapped? The value of the type. No, it is the type. No, it is the variable. No, it is the value. No, it is the variable. No, no, no. And I think I got it right. It is values which are mapped or not and only complex values. So this is the mapping. What they have been telling, right? You use the map operator, which is like that. Now, in Pog, you have functions, right? It is, again, lexically a scope, it is nice. It supports optional arguments. It supports variable length argument list 2 as an array of any. Also, I am a huge fan of Algo 68 and one of the things I like more about Algo 68 is that if a function doesn't get arguments you can actually use it in the same way that if it was sort of a variable, which I like much. And this was about the language. So now, unfortunately, I wanted to tell you how it works internally, but it's going to be super fast. So this is the architecture of the thing, of the application. Do you have a command part, which is the read line, you know, and all the things boring. Then you have a compiler, right? Which actually compiles pickle Pog, sorry, into a virtual machine, which is the Pog virtual machine. And it is the virtual machine through the instructions that access the IUSPACE, in this case, the file. So here, this is the structure of the compiler. At the right you have the disassembly of the Pog virtual machine instructions, which is a stack machine, because I love stack machines, of the expression that you see there. Please ignore the prologue and the epilogue. Those are the different passes and facets of the compiler. It does constant folding, some optimizations. I mean, it's not a toy. It's actually a very cute compiler. I invite you to see it in the source code. I made myself a macro assembler to not go crazy while writing the runtime and the execut generator. Actually, I wrote an AWK assembler so I can write the runtime of Pog and also you know the code generator routines like in proper assembly like this. Then the IOSAP system basically abstracts what you are editing from space of bytes into an space of IO objects, which can be those integers, strings or whatever. The important detail of this is that it doesn't have to be a file. It could be a process that you access the memory using Ptrace or whatever. It could be a file system because you maybe want to edit your extended tool or whatever, right? It doesn't matter. Anything that can be addressed by bytes can be poked. Then poked is extensible. Why? Because you can extend the application in poked. So for example, you saw, this is a syntax we can feel very proud of but I don't have time to explain it to you. Anyway, the dump command they have been using is basically written in one Pog function, right? You will see that the argument you know they all have a default value. So for example, you can say dump from the offset size besides, right? So you can pass arguments like that. And those arguments to define the new command, basically, you do it by writing a Pog function. What happened here? Yeah, there you go. A Pog function here, alright? So that's all you have to do to extend it. Now, pickles are Pog source files containing a collection of related goodies. Type definitions, functions, whatnot. Like elf, I have a test suite, it needs more tests and it sort of works. This is reassuring, right? I hope it is. So what works? Most of what I have told you today is working in Pog. Only one kind of IO devices which is files and things like that. But we need more commands and also there is a lot before a first release can be done. Supporting unions, por exemplo, which is work in progress. Support for sets, for enumeración, bitmaps, bitmaps, and things like that. More control sequences in the language. I have a for loop. But it is so cheap to add control sentences that, well, I don't know. And then after the first release, this is a list of big projects I want to do in general. But there is a lot before a first release in general. But there is still a lot of work to do. So those are the links for the project if you want to contribute. I have a home page, a many list, and also we have an ASC challenge in free note. And if you think that Pog could be useful for you and you want to have fun, please see the hacking file in the source directory, in the source tree because it contains a lot of information that could be useful for you. All right? So, done. In time, yes? Yes. I have a question. We can have six minutes for questions. How do you handle padding and bit endianness? Endianness, at the moment, you basically can set the endian to little, big or host, basically. I want to add a function that you can put in a struct constraint that will change the endianness run time when it is decoding. But it's not implemented yet. And regarding padding, there is no padding. I mean, in Pog, if you have an integer which is 7 bits, it's 7 bits. And in that sense, it is bit oriented. It doesn't pad, it doesn't align. What you describe is what you poke. Ah, there you go. Yeah, yeah. Okay. Did you take a look at KAITI structs already? So, do you know this project? KAITI is a K-A-I-T-A-I. Okay, no. Because they, it's an open source program as well. But it's meant completely the other way around. You define some binary structures and it's meant to compile into several languages reader, so parser of this. You can take their definitions, convert it to yours as well. We have a big repository of binary file definitions. Okay, no, but I will take a look because I want to translate all that to Pog. Yeah, very nice, yes. But it's oriented to write encoders and decoders, right? Decoders only. Encoders, okay. Is it possible to invoke any of these things as a library for say, if you needed to parse files and use the pick all definitions that are available with Pog? It will, yes. Actually, Pog, it started as an editor, you know, that's the main idea. But right now, I really think that it could be a very nice foundation for writing prototypes or binary utilities. Like one of the things that I'm going to write soon is a proper diff and patch for structure binary data, right? Based on Pog descriptions, yes. More questions? Just to make sure that I got it right. Would it be a tool where you can just define a binary schema once and then use it for writing and reading binary files? So all that you do with your files, you could use this loop Pog for... From Pog language in this case, yes. I mean, could be also, many people is asking me, hey, can I have a Pog to see? Can I have Pog to generate an encoder like this other tool is doing or another encoder? Yeah, why not, we can have a pickle written in Pog itself, do you know, that writes you out. Yeah, sure, yeah. That would be welcome actually, would be useful. We can still have one question. Okay, well thank you. Xau.