 So, I'm Dmitry Levin and today I'll be talking about situation package dependencies. This is a thing we implemented in our repositories and distributions in 2010 to address the problem of shared library updates. So, what's the problem? Every time a shared library is updated there is a risk that or when not just then a shared library is updated when a client of a shared library is updated or even if a new client of a shared library is installed there is a risk of compatibility between a shared library and some of its clients. This happens in repositories when a shared library is updated. It's much rare case when an application in repositories is updated it's usually compatible with shared libraries it used during the build but sometimes it can happen but mostly it's a shared update but in installations you can get these incompatibilities both when you're updating from a not very good package repository when you're doing a kind of selected shared library update when you're just updating the library and not everything that's built with it or when you're doing a selective update of a client or just installing a client from somewhere, not necessarily from the same repository so this is a kind of unavoidable risk that an incompatibility could happen so we usually want to control this and at repository level we would like to stop this and also in installations so what kind of ABI incompatibilities are problematic? spoiler alert, any incompatibilities are really problematic but some are more problematic or maybe you can encounter them more often I separated them in three groups, by the way you can detect them so first the biggest one is those that can be detected on the ELF level so when symbols, ELF symbols are removed or added or when some name changes that kind of incompatibilities by the way, not just removal of symbols is problematic but also in addition because when you're adding a new symbol to already existing versioning or just an versioning symbol this breaks forward compatibility which is important when you're installing third party clients for example so they could be built with a newer library that you have and when you're changing, you shouldn't really change a Russian of a symbol but some projects are not aware about this and they do and changing of version is even worse than almost as bad as removing symbol well, some other kinds of incompatibilities that can track a dwarf level like changes of function signatures in incompatible way or types of variables or what is incompatible way for example, like a number of arguments of function changes or the size of variable changes it's not as easy as it sounds to define formally what is incompatible change I will talk about it a bit later and all other kinds of ABI incompatibilities that don't reflect on the ELF or dwarf level just when semantics changes for example some day used to be a MIMC-PI implementation in GNU-LibSea that was working on traditional architectures like MIMM-MOOF and then it was changed because it wasn't fast enough and it wasn't formally part of ABI contract it was not documented to work as MIMM-MOOF but anyway it was changed and it was an incompatible change from application point of view there is a lot of application relying on this undefined behavior that kind of thing so why all these problems arise? it's because shared libraries are too easy to produce we just add an option to GCC and if you're using auto tools it's as simple as changing libraries to LTE libraries and A sources to LA sources it's too easy to make them but to maintain ABI stability is really hard it requires intelligent design and quite technical skills and the entity short is really high so yeah the classic paper on this is how to write shared libraries and it's part that describes maintaining APS and EBS it's eight pages of technical text so I would say it's pretty high so that's why a lot of projects are not aware about this so they do whatever they think is appropriate in most cases it's wrong and even experienced people can miss the problem this is an example from I have a lot of examples of ABI breaks so I'll just show you one from 2016 when there was a minor version update of OpenSL library that was expected to be just a security fix one of these security fixes that have names nowadays so it was just a minor update that removed some SSL2 code from defaults but also it removed some functions from libssl and it was clearly missed by maintainer because he didn't expect that kind of thing in a minor update but in case of OpenSL he probably should have expected anything but anyway it was too easy to miss without proper tooling so we were very close to repeat the same mistake in our repository but while I am showing you this example because the situations just got these removals, symbols so next day these missing interfaces were resurrected but quite a few people were just caught in a race because no diagnostics just application doesn't start anymore so what could be done about it? the first level of defense is of course a classroom we should educate our students how to write shared libraries properly but well it's not going to happen with all students I'm afraid so the second level of defense is our repositories we should have tooling to detect all these ABI incompatibilities as early as possible and well it's not really that easy I mean it's hardly possible to do it without human intervention because we will have a lot of tools that compare like compare ABI's of libraries but they have false negatives so they report there is incompatibility but at the same time human eye can spend a few hours and tell that they are actually compatible so it's easier to say than to implement but at least some simple things we can do we can at least ensure that applications that use interfaces from libraries are actually linked with these libraries and we can have tooling in repositories to check that no application at least in the repository is linked with two different versions of the same library when the surname of the library changes and not all its clients are built there is a chance that in some client two versions of the same library will be loaded with very bad consequences due to very powerful F symbol resolution algorithm so the repository defense line is not enough and we have to implement probably the last line of defense in every installation when the dependency solver works it should at least ensure that every library that is required is provided by something and every library interface is provided I think nowadays all repositories implement this both dependencies on libraries and on library interfaces but the one thing that is probably implemented only in our repository is the check that every F symbol required from a shared library is actually provided by that shared library this is called set versions but other implementations could be imagined so the first naive approach would be just to put all these undefined symbols into requires of packages and put all symbols that could be used for resolving this into provides this is more or less similar to behavior of dynamic linker why I say more or less because it's really impossible to implement dynamic linker in a dependency solver because it's really complex and any attempt will be a simplification but it's more or less how it works but the size of this provides and requires would be prohibitively long prohibitively big and resolving on this stuff would be too slow to be acceptable so slightly different approach would be to associate these symbols with libraries it's also a naive approach because symbol names could be a bit too long for example in languages like C++ and other languages that use namespaces and so on symbol names could be as big as you could imagine and even longer so while it's somewhat faster than the first approach because it's all these names are associated with some libraries yet it's still too big because of its unlimited length of symbol names and resolving therefore is quite slow too slow to be acceptable so the next step would be to use some probabilistic approach that is to put not just not the whole symbol names which are too long but just hash values and choose hash values of these symbols and choose the function the way that would produce acceptable false positive rate what is a false positive rate when you hash symbol or hash strings into values there is a chance that different strings will be hashed into the same value so when you are checking whether this particular string is present in the set of hash values there is a chance that a different string hashed to the same value will be found and this will be false positive so it's not that risky because you can actually control this false positive rate by choosing the proper hash function this way you can reduce the size of provider request to the level that can be actually worked with but besides false positives you are losing the actual symbol names so when the solver doesn't find that or when it detects the situation when request are not met with provides there is no information about actual symbol names it just says that they are not met what can you achieve in theory if you are using probabilistic approach so if you just if you are choosing some false positive rate you can work with there is an information theory minimum that is parallel algorithm of this rate which is for 2 in minus tenth is exactly 10 bits it's a theoretical minimum but if you use something that is popular like bloom filter you will get an additional multiplier by 1 it's about 1.44 which is one and a half more or less bits per string so for probability we use 2 in minus tenth it's about 14 and a half bits per string and for set versions it's much better I mean it's not the multiplier the theoretical minimum is an addition of 1 and a half which is like 3 bits better per string than classic bloom filter and the complexity of this is very good it's big O of the size of request and provides and because this request and provides don't depend on the length of symbol names they are really good so how these set versions is implemented it's a combination of a hash function which is chosen on the size of the number of strings it's functions and variable names so it's a downgraded Jenkins once in a time 32 bit hash downgraded to the size that is enough to to guarantee us false positive rate we want for the case of 2 in minus 10 it's just a 10 plus a binary logarithm of this size of the set so then this set of hashes is sorted delta and cold and then compressed used column price encoding which is natural for this kind of thing and then it's like converted into a Afronumeric representation used base 64 base 64 is just 26 multiplied by 2 plus 10 that's why this way these sets are human readable and for decompression you do essentially the same pattern variables order you decode these base 64 strings and then decompress them using column price coding this is how they look in our system in practice I've just taken a arbitrary package it's a little form of futiles it depends on 3 libraries and this set version depends on how you look this way compared to those you're probably familiar with you can see not just a surname but also a set and after the column you see this set version encoded in base 64 human readable form and unproperate provides as you can see they are much longer than it requires it's often the case when clients use only subset of this function of the interface provided by libraries so how good is this compression rate I remember it's the theoretical minimum is 10 the theoretical minimum for situations that could be implemented is 11.5 and you can see in this red line so that about half of these libraries they have bits per symbol ratio 12 and almost all of these have bit per symbol ratio lower than 12.6 so it's more than theoretical minimum but yet it's much less than you can get using bloom filters also on this picture you can see that most of libraries are actually not very big like half of these libraries have less than 130 provided functions and variables so what are the pros and cons of this approach summarizing you get the guarantee that every else symbol required from a shared library is provided and the check is performed not at run time but during package resolving time so in the beginning of transaction and performance is quite good but at the same time like the check is still it is still probabilistic but you can control this you can choose the rate that is acceptable for you anyway it takes time so it will no matter how fast it is it will be longer than no check at all but probably not unavoidable this provides they are still quite big for big libraries and even for not very big libraries they still look like bulky and the most important that there is no error diagnostic that would contain symbol names because all symbol names are lost they are operating with hashes and base64 is a funny thing it produces nice compact as key representation but not as compact as you would like so base85 would save about 8% of the final size if you are thinking about implementing all this stuff you are aware about obstacles which are code complexity and integration complexity we use math we use some not very common algorithms and the implementation is heavily optimized for performance so it would take quite a lot of efforts and time to understand some people are really not ready to invest their time into complex things for example the main RPM or guy once admitted that this implementation is too clear for his taste and just refused even to discuss this so and with regards to integration the problem is various projects they tend to implement comparisons of package versions and releases themselves and they are not aware about this and either you would have to teach them to use the algorithms provided by operating system or make them use this complex code themselves so it's not that easy this is more or less what we have since 2010 it's quite stable but we have some ideas for the future it's really like thinking progress replacing base64 with something more compact is quite easy but some results for optimizing the selection of golem-rise parameter if you know what it means it's traditionally called MBIG it decides what part of how to tune, it's actually tuning of golem-rise coding it would be backwards compatible we also have ideas how to check to detect elf single names and changes but it's quite complicated because of this resolution process because non-version symbols they could be resolved either to a non-version symbol but also to a symbol with any version and from another side a version symbols could be resolved also into unresolved unversion symbols so it's not that easy also we have some ideas about detecting dwarf level incompatibilities so using signatures instead of function names but not just signatures that somewhat reduce signatures to ignore insignificant differences but it's like I think we're still thinking about insignificant like sign it, where's unsigned it it's mostly insignificant change in practice when you're talking about ABI on some architectures long and long long are insignificant because they don't differ they are not exactly the same but compiler can detect the difference these are just ideas I have some links at the end of these slides these slides are uploaded so you can have a look at these papers and have a look at the code sometime later probably not now that's more or less all I would like to tell about this so if you have some questions please so is the server bottleneck or network connection bottleneck that index would be too large yes index would be too large and the check will be too slow because it depends on the size of the whole thing so checking comparing hashes is much faster than comparing a bit of strings you know and yeah indexes would be too large my approach would be to build the hashes basically I would completely mitigate this problem because then I can compare hashes when I calculated them for the index and it's only just the job of the manager and then I have only just problem of the index size I think that's at least my opinion the practice shows that when you are trying to pick all these function names into indexes they are growing too large like really too large this is 2010 project so it was even more important like 10, 8 years ago you can try to do this at home I mean to put all names into your indexes and try to live with it basically we are maintaining open with WRT and we have basically we are doing something like this but we are doing it on the level where we basically are forcing the versions and we are specifying the build versions of that package where we force specific version because we have consistent repository that's the open WRT thing but I know that we have the database that is just simple files with versions symbols dumped out it's not that big it's around if I remember it correctly for megabytes or something like that yes maybe in 2010 it was big but now how many libraries do you have? in our repository there are more than 6000 libraries just one, probably one question and we are almost out of time thank you for coming