 Hi, thanks for joining our presentation today. It will be about a new feature that we have added to the GNUTL chain. The name is CTF, which stands for compact type format, C type format, and it is actually debugging information format, if you're familiar with what those are, like dwarf, stabs, cough, and we have two presenters. I mean, I'm just doing the introduction and Nick Alcock, which is actually the main engineer on this work, will actually explain you all the details. This talk is kind of a follow-up, as you see from the title, Progress Report. It's a follow-up from a talk that I did in 2019 last year at OSS Europe in Lyon about how we had started this work and what had been contributed upstream at that point. So this is a perfect time and a perfect opportunity to show you guys what else has been done to introduce this new feature in the GNUTL chain to support this compact type of debugging information. So I will leave it here to Nick to continue, and we'll see you for the Q&A at the end. Thank you very much. Nick, you can go. Okay. So the contents are fairly obvious. I basically planned to dive straight into the file format and give it an overview of it because that's the key here for the file format is what it is that makes this thing worth using. There is a library which can be used to access the file format, so you don't need to do all the work yourself. I'll go into that a bit and describe the new functionality we've had even last year, which is getting on. By the time this presentation actually airs, I hope we will have everything in place for it to be completely usable for everybody. The last pieces are going in now. What is this? It's a model of the C type system or of a single scope of the C type system. If you think of a single translation unit, CTF can record the types of, for example, all global variables or all global types appearing in a single translation unit. It does not have scopes. It is a single global scope of some C file. It can also map anything in the old symbol table or any function, any global data object to a type and all the types that relate to that type and so on and so forth. It's generated by GCC with one extra, you pass one flag and it spits it out. Linker duplicates it whenever it's present and emits a generally much smaller CTS section in the output. And GDB can use it or once the patch is upstream, it will be able to use it to look up types and so on and so forth. Obdump and Redelf can dump it as well. It is small. Dwarf is known for being rather large. This is about 5% of the size and sometimes smaller. It will get smaller as well because we haven't actually focused on size reductions yet. All the types are duplicated so the output is often much smaller than the input. The spec is here and there's a link to the API at the end of the slide. And also there are all these links are also in the references at the end. Here's an example of a size user using important programs. All the sizes here are size of the CTS section alone. The largest single input O size is actually the size of the CTS section in the O file, which had the largest CTS section in it. You can see the input we don't really bother to do duplicating very much. The priority is getting the compiler to output it fast. So 50 meg on the input turns into 50k on the output, which is not bad and adds half a second to the link time, which is much less time than it would add if you were emitting Dwarf, because Dwarf is so voluminous. Emacs is an interesting case because it's got a whole bunch of types with the same name and different definitions scattered throughout itself. We can handle that. Now the compress size is a bit bigger, but not much. The time is noticeably larger because it's got lots of translation units and at the moment the link is not to be fitted. Usage, I've really already gone through this. You compile with minus GT. It spits stuff out. The patch to cause the compiler to omit CTF is not upstream yet. It needs to be reworked. It is being reworked. It will quite possibly be reworked by the time this presentation is. Here's hoping. LD picks all of these CTFs up, de-duplicates it and omits one in the output. In theory this should work for everything, not just ELF, but this requires a bit more work in LD at the moment. At the moment it works in ELF and that's enough for now. In theory it could work with everything. I do want it to work with PE so that Windows programs could benefit, because why not if it's not much work? GDB can use it, or again the patch isn't upstream, but it will be upstream fairly soon. We really think we can upstream it until the compiler is upstream. LD and GDB use the same library to do all the CTF work. There's only a few lines of code outside that library and either of them are a few hundred lines in GDB's case. The compiler doesn't use that library. I think the garbage collector in GCC gets in the way or something like that. It was easier not to use it. You might as well dive into the file format. I just need to drag this window out of the way. Sorry. It might as well dive into the file format. It's described in the CTFH header that Binyu Tools installs. It's pretty simple. It's a header and then a bunch of sexy sections. It's an annoying name, but the CTF is itself a section inside an ELF object. We might rename CTF sections to something else simply to make it this confusing, but for now, sections they are. They are changeable. We always guarantee that we can read old format, but we might bump the file format version occasionally to increase compactness or add support for new languages or something like that because this may not always be C only. The version is three at the moment. We can still read two and one. One is basically the same CTF Hilaris use, but not quite. I do plan to add support for that for reading those. V4 is being planned. All planning for V4 will be included in boxes like this one. I'm going to have hints about what that is throughout the talk. There are flags in the header. There's only one flag at the moment. It says this is compressed with Zlib. We can press big enough CTF sections with Zlib. Small ones does not really any benefit. Virtually everything you'll ever see is compressed. We in fact optimise our files. The file format is optimised for best compression. We only add things if we think they will compress well. What are the sections? We may as well go through these in order, ignoring the unused ones. The first two are the data objects and function info sections. These put together have one entry for every symbol in the L, every object and function symbol in the L symbol table. Each of them is simply a type index, a type ID in the type section. It used to be much more complicated, but we simplified it recently. This means that given a symbol number, you can immediately ask CTF, what is the type of this variable? What is the type signature of this function and get it back, and then you can wander around the structures in the function or whatever. Before the linker runs, I can probably go into this later, but before the linker runs, we don't know the order of the symbols and we don't even know which symbols will exist in the output. We need to be able to communicate between the compiler and the linker. For that, we have an index section for both of the data objects and function info sections, which you can use to say, rather than working in symbol table order, use the index section to assign a name to each of these symbols, and then you can just look up a symbol by name. In practice, you'll probably very rarely see this outside after linking has happened. The libctf API completely abstracts this all the way, so you never need to pay attention to it. But the option is there. There is also a variable info section. This will probably go away before being replaced with a data object section, which simply lets you say, I've got all these data objects which don't have symbols. I want to know what types they are. Most people aren't going to want this. You need your own symbol resolver because all you've got is a name and a type. What looks up the name? With no symbol table entry, you wouldn't have an address or anything. It is sometimes useful for things like kernels and things that have their own symbol resolver. It's not limited by default. You need to pass in a special slag to the link of CDF variables to turn it on. But the option is there. Then there's a type section which records all the types in a long array. And then the string table. The string table is shared with the L string table. Any strings which exist in the L string table are eliminated from the one in the CDF file in the CDF file of the same space. References to string in the rest of the CDF file can refer to either the L string table or this one. We sought it because that pushes compression efficiency up a bit. We tried clever tricks. They didn't actually take space after you compressed. Modern compressors are very clever. So what does the type section look like? This is the most important section in the file, obviously. In the dictionary, I should say. It's just an array. The length of each array entry is variable and includes its own length. It includes its own length, but does not include any kind of identifier for each type. When you open the file, you have to walk through the array and associate an offset with each array entry element. So you can tell which type is which. They refer to each other by these IDs. So the types are important. The IDs are important, but they are not recorded in the file. This is simply because if you recorded the ID with every entry, every ID would be its own symbol to the compressor, and it significantly increases the size of the file after compression to have entries be self-describing. So we intentionally avoid it. This is a significant difference from Dwarf, but it also has great links to self-describe everything. The type essentially is very simple. Each entry looks roughly like this. It's got the name of the type. If there is one, zero, if it isn't. An info which describes what sort of thing the type is and either a size of the type or some other type this type refers to. You can't do both at once. You can use one structure to use the other if you see. The info word is crucial. It's got three entries in it at the moment. This is probably going to change in before. The kind of thing is it meant to be or is it a pointer, or is it a struct? These all have different representations inside CDF. Is it visible to users? Some types aren't. Sometimes you can't look up a type by name. It's very rare, but it can happen. If types have root turned off, you can have duplicates with the same name and they don't conflict. And VLIN is the remaining links of this entry, the distance to the next array element. It's not the size of the type. It's the size of the description of the type. V4 will probably add a more compact representation of the small types and types that refer to types early in the type graph like each because there are very many often. This is followed by variable links. There might be none pointers. There might be little fixed size structures arrays just have a thing saying we are this length. The array is this length. Its elements are of this type. And then there are things that have an actually variable amount of variable. They kill structures. VLIN is the number of members and for unions. VLIN is the number of enumerations, that sort of thing. Huge structures actually get a different type of array member because you can have structures with enormous offsets, pain gigabytes or something. You don't want to waste space encoding enormous offsets. So we have a separate source structure to encode array elements. To encode structure members if the structure is vast. This is actually true of the type section as well. If the type is more than 32 bits, type size is more than 32 bits can't record it. We stick another couple of elements on the end to give a 64 bit value. There's no point wasting space in much types of that. So we overlaid them on each other. VLIN will probably add another sort of array member for small structures less than 255 bytes because most structures are small. So we can save a bit of space. It will probably also have a place you can record a prefix. For example, if you look at the CTFS type, all the structure members start with the same name, start with the same prefix. This is very common in C for a ridiculous historical reason. So we should probably exploit this. The trick about doing things this way is that if we have no constant prefix, we just don't fill in the constant prefix member and another space taken by the compressor to record this is zero bytes, so it compresses well. Compressing no bytes always works. There are a few unusual sorts of types which don't actually appear in C, but do appear in CTFS, a distinction between the storage format on disk and what users see when they want to see what a type system is like. This isn't hidden very well by the API at the moment. When you're creating things, you want to see them. When you're looking them up, you don't. So we will probably improve this in the future, compatibly. At the moment, we only have slices. You can say the finished slices of this integer is three bits wide and starts at bit seven. It's used for bit fields and nothing else. V4 may add structures, if they turn out to work, which would let you say this structure looks like that structure over there, but it doesn't have these members. It's got these members instead. You can delete a few members from it and add a few. This would probably save spaces if you have translation units with a lot of similar structures in them and so on and so forth. At the moment, I'm stymied by not being entirely sure how to figure out when to use deltas, but I'm going to try to add them and see if it works. A couple of examples. This is using C99. It doesn't really get initialized in syntax. If you're not familiar with it, this is just a structure member name after a dot. This is roughly what an integer looks like. The name would be the offset of the stream inch in the Stratab. The info would have a kind of integer and it is root visibles. You can look up and get this type back and it has no variable length data. The type is four bytes long. The variable length data, it so happens, is also always four bytes long for an integer. That's got nothing to do with CTT size. It simply says in this case, the integer is a signed integer. Oscars and bits will almost always be zero. We're probably going to endominate them in format V4. A more complicated example, a structure member or a structure with two members. Much as before, the CTT name would be the offset of foo in the string table or in the elf string table. I think we're going to do the string tables later. The high bit of the string table offset is one. If you want to look it up in the elf string table and zero if you want to look it up in the CTX. The VLAN is a bit different for structures. It's a number of field and structure members. The size is the size of structure too. The difference here is quite obvious. The size is the size of the type. VLAN is the size of the description of the type in the CDF. In this case, the variable length data is two instances of struct CDF member V2 giving a name. The offset in bits, not bytes, this provides another way to encode bit fields, the recommended way in fact. And the type, which is a reference to a type ID, which is an array offset in the types section. So the types change in each other and you can chase them down. You can have more than one picked at once, but this is a single gigantic C scope. This is sometimes not enough. For debugger uses, you often want to pile as many types as you can together. So the users can cast from one thing to another without worrying about where they're visible from the translation unit they're looking in at the moment. But not always. Sometimes you might have the same type of the same name, but completely different definitions Emacs, given as an example earlier, has lots of these in this page in particular. CDF Dicks don't support these, but Dicks can chain to other Dicks as parents. It's a two-level tree. And child Dicks can refer to types in their parents. There is an archive format, which groups lots of CDF Dicks used together into named groups. It's not exactly ideal, but this is, at the moment, what the linker emits into the CDF section if it finds ambiguous types and needs to emit more than one. In future, we will change this to keep the same API so that libcdf users don't need to touch things, don't need to change. Because the problem with doing things this way is they all have their own string tables and they all have their own single tables and they're not duplicated together. It works okay, like the archive format, speaking as a person who came up with it, and we're going to change this to something else, compatibly. The easiest way to use all this stuff is via the CDF. It's part of Binyutils. It's shipped whenever you've got Binyutils. You might need Binyutils to develop instead. The API is grouped into a group with a few big chunks. You can add, you can create dictionaries via CDF add. If you're familiar with the old Solaris libcdf, this was a very bad API in Solaris. It got exponentially slower the more you added. You can add millions of entries and it's just the start setting a few. There's a CDF link which lets you take simple dictionaries, add them into a bigger one and then write the bigger one out. LD uses this to link sticks together. It duplicates them as well, although you need Binyutils master for that. You can call CDF many of these things only work on newly created dictionaries which are writable. You can't modify dictionaries which have been written out once you read them in. But what you can do is you can query them with CDF type which lets you wander around chase types to other types and so on and so forth. Then a lot of improvements in this area which I'll go into later. You can open and close dictionaries and you can open those whole health objects and it will look up the CDF section for you. There are a few functions to do with CDF archives and there are iterators that you work over all types in a dictionary all members in a structure all enumerators enumeration members in an enumerator that sort of thing. The API in any released versions of Binyutils you plan to change it if there are going to be some renaming to make things clearer old code will still compile old code will still work some of the names are terrible CDF ARC open by name CDF DICT open is a much better name it opens a dictionary but the old names will always work. What have we done in the last year? The linker has turned from a fairly terrible non deduplicating linker that didn't try to enumerate types into a linker to deduplicate types that does deduplicate types which is something Dorf never quite got working it has changed type definitions together to distinguish them from each other and mixes in types they refer to types of the same definition always end up with the same hash no matter what translation unit they're in and get emitted once will also end up in a shared dictionary unless they're ambiguous in which case they go into a child most types end up in the parents nearly all the time and we can deal with cycles and so on and so forth there is a link at the end of this talk to another talk which goes into great detail more detail than you could possibly ever want we don't track what translation unit comes from where what type comes from what translation unit because that costs space and most types are visible in most translation units in practice the debugging users don't care they just want to be able to see the types and use them we track them only if they can flip if a type is ambiguous it will end up in child dicts named after the translation unit and then you'll only see them if you look in that translation unit there are actually a couple of alternative ways to distribute types the common one is that all types which are not ambiguous go into a single shared repository here you see struct bar and struct quigs struct bar and int quigs particularly int quigs only appears in one translation unit but it still gets moved into the shared dictionary variable this is in fact a bad example but variables are also the case in a variable or should probably be used a bunch of structs because they actually type some of it or type this but if int quigs were a type of it then there is an alternative where every type that appears in only one translation unit goes into its own dictionary this is probably mostly not useful but if you have a program in which translation units think we use massive types they use nowhere else this might save time when you're opening the dictionary and avoid bringing massive numbers of useless types in the scope when you're trying to look things up if you're using this you probably don't want to use the normal linking there are some extra features I'm not going into which let group type group translation units together so they're not one per TU but this is obscure and I don't expect most people to use it it's only for enormous projects really other new things we have there has long been a way to iterate over things even the salire implementation had this you could call something to iterate over structure members and it would call a function for you with every member one at a time this is when I came to write the ggplating linker which used this a lot it was incredibly clumsy because it meant that you had to introduce a new function call a new function was called for every structure member all the new these needed new variables you needed to pass arguments down if you wanted to with special structures if you wanted to share a variable it was so clunky so I came up with a new sort of iterator modelled on typing generators and things like that and frankly modelled on c4 loops which return a new value on every call so for structure members you call it repeatedly and you get a bunch of names back for structure members and structure member assets it's much much easier to use I have an example in theory but I'm not sure I know how to have a screen sharing works my switchwinders there's all one of the uses for this is an error warning stream if you call CTF error warning next when you get an error out of live CTF it will hand you back a stream of human readable error messages translated into an appropriate language if the translation is available not everything has human readable errors and warnings yet it's mostly there for the ninker at the moment because when there are problems you want to know what simple cause them and that sort of thing assertion failure is also going there we will never crash your program unless possibly if we're very unlucky and you run out of memory we prefer to return the ECDF internal error and spit assertion failure into the error and warning stream we don't want the linker to crash because of a problem in the debugging format new functionality which is so new that it hasn't been debugged or pushed yet but hopefully will be debugged and pushed by the time this presentation goes live the function for when data objects mentioned earlier will get emitted the format is changing but I'm not bumping the version number because while the compiler tried to omit the old form it actually got it wrong so no existing version of libctf could read it which is convenient because it means we can be certain that that old format is completely unused and can change it freely without bumping the format version there are a few new API functions to look up types of symbols iterate over all symbols you can say given the CTF archive tell me which dictionary inside the archive has the symbol has the title of this symbol and it can give me the type back and there are functions to add symbols to dictionaries as well because how else would we create them CTF lookup by symbols is changed if you're used to it in the nearest world you can only call it on data objects now you can call it on nfn you can call it on a function you get a function point of type back all functions would expect all functions in the program get added to the type section or all public and visible functions get added to the type section as CTF case function types because there are visible types in the program every function can be used as a function point and so we should add it to the type this is very different from older implementations it hasn't been debugged yet it still crashes it won't crash by the time it's pushed what are we doing in the future there have been a number I mentioned earlier in the talk a lot of changes in format v4 from practice improvements there are other improvements coming GnuC can encode some things that we can't represent in v3 mostly type and function attributes but not entirely we can't represent enumerators with more than 3 billion 4 billion enumerands for example not terribly likely but they could happen so you should be able to encode them we plan to improve compression if it were micro trails in MA compressing everything my only question there is what happens if LZMA is not a required part of binutils unlike Zedlib what happens if a user who builds binutils doesn't have LZMA we would then have a CTF solution they couldn't read and certainly sure what you do in that situation it has to be thought about we're going to add limited space to the CTF to CTF dictionaries and not that you can divide the file into a dictionary into namespaces iterate over the namespaces and look things up in a specific namespace this means we can drop CTF archives for their use in third L file and just have one CTF dict which refers to every translation unit sharing the simple table sharing the string table it would be a significant compact improvement but still appear to be CTF arch blah blah blah but the internal representation would change we want to add a backtrace section this has been a goal from the start which would let us describe where parameters where parameters are stored without needing an entire virtual machine to evaluate Dwarf expression location lists like Dwarf the idea is we hope to be able to represent 99% the simple 99% of the function calls while not leading enormously complicated readers and it would obviously refer to types by CTF type ID and pointing type section like everything else in CTF we're still designing it pretty sure it can be done it should again be more compact than Dwarf but most of the compactness improvements will probably come from using the CTF type section in the first place and the final thing we want to do is use it in more things it can be used in GB we want to add support to it for it to for it to power and the various other Dwarfs tools but we can do more than that if libraries gain.ctf section which is easy you just compile them with minus GT this section appears in the output shared object you could use this section to automatically generate header files we can't yet generate cco from CTF but it's perfectly doable you need format v4 for this because we need to be able to represent parameter names but that's all we might be able to have LD or LDSO exploit the sections to detect ABI problems this would need to change improvements to the CTF so it didn't malloc so freely but it's perfectly doable and it seems like the sort of thing which would be useful because we could spot not just that this simple has changed version but that this symbol refers to this type deep deep down inside the structure which is inside the structure member which is different from the other one because of the DGB this is actually really fast you can rely on the fact that types have the same ID if they have the same type and different IDs if they are different types they are one integer once we have a backtrace section we need to teach GGB how to use it it's probably valgrind as well other things that want to print backtraces fast and as a completely blue sky idea we might even be able to there are already built-ins in GCC typos and so on which query things about types and C programs they're pretty restricted you should be able to get more into about types and C programs like you can in other modern languages that support introspection once you've got that why can't it use CTS to query to ask about types in other translation and in other libraries it's a bit blue sky it requires fairly significant changes but I didn't see why it's not impossible there's a link to an LWN article with even madder ideas I encourage people to come up with more ideas what could you do if C is completely introspectable and you are completely free to look up and edit at any time also of course one last thing we want to add is other languages the first question we were asked is what about C++ what about C++ oh god it's a nightmare but it seems possible probably most of the file format would have to change and we were a C++ version of the CTS type session but why not but that's for the future for now this is what we've done any questions?