 ʸɪdəm Ḣaʸm Ḥʰᵃᵃʰʰᶦ ɪdəm Ḥʰᶦᵃʰᵃ ʸɪdəm Ḥʰʰᵃʰᵃᵃ ʰᵃʰᶦᵃ ʸʰʰᵃʰ ᵃʰᵃʰʰᵃʰ Ḥʸᵃʰᵃʰᵃᵃ Ḥʰᵃᵃʰᵃʰᵃᵃᵃᵃ ʸʰᵃʰᵃʰᵃʰᵃʰ ʰʰʰʰ�˰ᵃʰᵃ � werk Treaty On掉wi RI to the title and hence  discourse On comparable for the extreme maximum ��릴게요 yOur data centric compression 靈 t'enk ni exkrim léج farewell ㅋㅋㅋㅋ Actually there is some kind of pharanding irl. Machine Learning.  Lab Cintérieur There so it's not all bluff ed 대� blew correctly. It's about zip archive and compression. Now we begin with a small experiment. First we take a text and we compress it. So you have a bunch of random bits shown with pixels. And then you concatenate text one and text two. And you have a bit larger bunch of bits. And the third one is only text two compressed. And as you see and as you have guessed, the compression of the part above text one is better than the text two alone. Because the algorithm has learned something from text one. Of course text two is similar to text one. It's why it works. Otherwise it would be the other way around. And you see better with some animation. So I swap between A and B. So text one and text one with two concatenated. And you see this part is identical. You will have spotted a few bits but it's part of a header. It's before the compressed stream. So it indicates that this part will contain the compressed version of text two. And the interesting thing, it is much smaller than just compressing text two alone. Just without prior knowledge. The conclusion is like with sports and other things. Compression is better with a little training. So as I see it, perhaps there are more ways. But I see basically two ways of doing that. Either you compress the text one and you save everything from the compression algorithm. The dictionary, the probability model and the state of the finite state machine basically. So you save all that. Actually it can be a very large bunch of data, this information. Much larger than the things to be compressed. Or you use prefixed data. So you do actually exactly like in the experiment. You feed the algorithm with text one. But actually you won't use it in live. It's just for training the compression algorithm. And then the text two is what you want to compress. The advantage of the lazy way, so the way two, is that you can just reuse a normal compression algorithm. You can leverage it as is. So what I've done is I've made a package that you can plug into a compression algorithm and it does the job. You don't have a boring API or very complex data structures or for saving things from the variance one. And stuff like that. There is one disadvantage of course when you run the compression side it takes longer because you have to compress what was text one in the experiment. It can be time consuming but sometimes you don't mind. You are more interested in decompression often. So where to use this trained compression? For instance you anticipate having a large amount of data that are all similar together to some extent. Of course they are not the same. So you can have a sample. You use it as a training data. And then you have thousands of different files. You don't know them in advance. So perhaps it's from something over the internet. And you will always use the trainer file first for training the algorithm. And as with the experiment you will save storage. So perhaps if you have millions of this text too it can be a large storage or you can save transmission time. I kind of know an application on this side of the room. And perhaps there are other advantages that can be found but it's what I've found by developing that. So how it works a bit graphically. So here you have the compression site and here you have the decompression site. So basically you first you put your training data so the text one into the compression engine and then your data and what will happen. I need to say it must be a streaming compression algorithm. So nothing that mixes the order of the stream or otherwise it won't work. But if it is a streaming like LZMA or Deflate and LZ based algorithm for instance you will have first the compressed training stream. You discard it, you don't need it. But now the algorithm is trained. The probability model is trained and the dictionary is filled with nice words from text one. And then comes data too and here you have a big wall or hurdle and you need to send it over. It's slow or expensive or whatever. It's over the internet, I don't know but you want to have a better compression for that. So you have your compressed data sheets there. What happens on decompression site? First you stuff the compressed training stream but of course this one was not shipped over the big hurdle. It was already known to the computer on the decompression site. So it's shipped only once in the life of this machine. So you put first the training data and then the compressed data which was of course unknown to the decompression and what happens as you guess you have the training decompressed you discard it, you don't need it and you get your data decompressed as you wanted. So here is the data specification of this trained compression so I didn't do any fancy object oriented thing it's only with generics and you don't have... it's very simple, you have only a function which gives bytes and a procedure which writes bytes in each time but the difference to similar things you have two inputs you have the training input and the data input and the decompression site you also have two inputs one for the compressed training and one for the compressed data because you need two functions because one is will take locally from the local computer basically and the second one will receive from the big antenna the interesting data once you have provided these functions and procedure to have inputs and outputs you have on encoding site encode and it does the job and you have also decode on the other side so you were certainly impatient to see some results so I begin with perhaps with data where it's not too impressive so I have CSV files it's things like economic data that you can download on the internet so it's all public it's also all on the source forge and repository and GitHub mirror if you want to test it yourself with CSV files basically you have random data but a very restricted alphabet you have figures and commas and end of lines so there is nothing that the training can bring actually not a lot at least here you have on the left side you have Excel files the old binary format with plenty of metadata and then you see it's much more efficient because all this metadata is present in all Excel files so the training brings a lot so perhaps in the last 8 or 5% it really squeezed the real information in these Excel files and in the middle I've just compressed the encoder using the decoder as a training file of course they are similar because you have the Gnat runtime and perhaps the debugging symbols and things like that and you get a lot better compression by training the algorithm so here it's a windows executable so it could be different with different format but it's almost one quarter of what you would get with LZMA which is a very good algorithm now the second part is more about the ZIP EDA library so what is it? we are at the right place here it's a fully open source and we are in the right dev room so it's fully in EDA you don't need to even think about interfacing the stuff usually it's the painful part of using mixing languages and it's fully portable as long as you have a compiler supporting a certain number of integer types so you need a complete EDA compiler with the necessary integer types so portability you have just one set of sources you don't have magic with conditionals and if-deaf and these kind of things especially in compression software usually you say kilometers of if-deaf define, include, blah blah so that's not there so more details about portability so again there is no binding it's standalone it uses EDA streams and exceptions you can monitor, control, analyze your program including the compression library with the same tool set and on the right hand side here is a list of platforms where EDA has been in use or has been in use for some time so I didn't check personally all the platforms but I trust people sending me mails about it if you have another platform please drop me a line because the bigger the list the better the history of ZIP EDA so it's 20 years old soon so first it was a bit ugly translation it was only decompression it was a translation of Pascal with some... it was not so nice and over the years I had a good idea or someone perhaps told me or recommended me to put it on an open source platform at the time Sourceforge number one or whatever so in 2007 I put it on Sourceforge and then the magic of open source began because people began to use it and soon after I got a nice contribution in 2008 so basically this team in this company did a big chunk of the ZIP EDA library and they sent me per email so there they have programmed the streams before it was only files and they did the ZIP archive creation site and it's nice because they did it very orthogonally because you can have a ZIP archive in a stream you can have everything in memory or whatever medium you want you don't need files so you can have your almost kind of RAM disk or static RAM disk using a ZIP archive with a file system and everything and then I got other contributions and I developed I have added some decompression things and later I have added I have the bullet I have added the LZMA decompression that was not so easy but fortunately it was due to the fact that there was a well finally a well written C++ simplified reference decoder for LZMA by the author of LZMA so I jumped on the wagon and I have translated into EDA and then I wanted to do a bit more and have a strong compression so I began with deflate and then it was in early 2016 then during the vacation of 2016 I did the LZMA compression actually it was easier than anticipated because you can mirror many things from the decoder and swap the bit input into output and now what I have presented just earlier the trained compression so how does it compare to other methods and so on there are plenty of libraries if you check this squeeze chart site it's done by an expert of compression he has hundreds of different schemes and he lists only those that are working and the problem with compression the definition of a good compression it's a bit shaky because some people just want to have a good compression ratio but sometimes if it takes too long to decompress it's not no more interesting because you have anyway you have good capacity networks and so on so it's also important to have a good decompression time or sometimes not but if it takes really too long to compress the data it can be also a problem or perhaps not because sometimes you have something you compress once and it's downloaded thousands of times so you can care or not about compression time and finally that's more about the embedded systems the question of memory footprint you want to keep all the internal tables of the compression and decompression within certain constraints so here you have a picture with two of these criteria so you have the compression ratio the smaller the better and here you have on the logarithmic scale the time of compression so as you guess the more you go to the frontier of compression the longer it takes then it's healthy the Zepeda library is not the old best on this part but perhaps who knows in a few years with some effort and whatever so a little bit about the internals so deflate and LZMA so the things I have implemented relatively recently there are both two stages two phases algorithm so you have a first string matcher so the LZ77 and then you have the entropy encoder to squeeze this pre-processed signal basically and it's the same principle for deflate and LZMA but with deflate you have Huffman trees LZMA it's range encoding it's a much more powerful compression stream scheme for entropy so now how you again a comparison only with the Zep format so here it's a kind of standard benchmark data set Silesia corpus so here you have the 7 Zep and here it's info Zep so the same as Zlib BZ2 and again 7 Zep but with the LZMA compression and here in green it's the results with Zep data so it's not too bad and one good thing of the fact that it's in two phases you can pick and choose the algorithm the string matcher so the LZ77 part and see how it works with the entropy coding sometimes you have surprises so for instance I've picked for LZMA something that was for deflate and it works pretty well and also for certain kind of of data I've discovered it's better not to compress with the first phase just send the plane byte it's best better than with string matcher so finally another let's say innovation in the Zep data library you have a kind of meta method that's leverage or that corresponds to the fact that in the Zep file each entry so each file in the archive is compressed individually so you can pick and choose the algorithm for compressing it depending on the length for instance or the type of data especially the length I've noticed that the files that are shorter than more or less 9K 9KB they compress poorly with LZMA if you see the probability all the tables for the probability model you understand it needs a long stream to train the algorithm so an individual file shorter than 9000 you can just forget LZMA you take deflate and it's better so that's for the presentation and that's it I'll make a little demo perhaps perhaps it's interesting hopefully so here you see a Zep file it's a spoiler for the next presentation but I just show you the contents with a user interface in the meantime the user interface so you see a bit the contents I've put some text some XML files, database we'll see what happens with different schemes so I'm just unpacking the thing with the common line version which is shipped as a demo the library on SourceForge and what can I do I can show you when something is decompressed you also have a hash code checking to be sure so everything is fine of course you can use the compressed file so there is no magic for instance I opened this file and oh you see the charts that were before and now if I for instance I recompress with the fastest deflate so basically the default scheme so you have a new archive as expected so now let's say I'm paranoid I'm not what is this ZIP AIDA library so there is a tool called CompZIP which compares every contents of ZIP file but byte by byte not only the hash codes so it's checking so it opens every entry of both archives oh everything fine now I'm just showing the options so you have different compression schemes and what I've just shown the latest this pre-selection method so meta method for doing pick and choose so let's see so for instance the PDF files it chooses LZMA with certain parameters so you have 225 different ways of configuring LZMA for instance and so you have different sets for different data types so as you see the latest compression so with pre-selection is hopefully smaller than the one with deflate and what can we do yes it's it's it for the demo I have some further slides but I don't want to it's a bit technical so I think I will stop the presentation here but I'm sure there are plenty of questions actually you have a technical question here that's concerning the the trainable part of your algorithm so judging by the timeline it was added like within the last year I would suppose nowadays people would use record neural networks that would be the first thing that springs to mind then you ship like the train the topology was training parameters so they are the set of these complementary generative neural networks to reconstruct the data is this the approach or the approach is just very simple just feeding data to the same it's very lazy I just put into the compressor so you just tune some of the parameters of the compressor what is the pre-training in this case the all compressors begin with a neutral state so you basically just train to shift the parameters exactly and it seems I'm very lazy instead of saving but probably there are smarter ways or using you could use this one should be able to be specific data compression would of course be generic data compression and it really depends on the main you can get up to 99 points many nice of course it really depends on the domain but since this is text like textual data we are talking about that's why I suggest it aren't it was only for the the experiment but you can put binary you can put everything depending on the nature of it yeah but as the chart with the comparison you had excel file so I've trained with an excel file for compressing an excel file an executable was used for compressing an executable for the same machine and so on of course you leverage all the as a human when you put the training data you know or you guess it will be similar the best compressor kind of builds the model of the data but then since you are operating with standard algorithms your models are fixed but okay that's yeah I better let others ask me so yeah thank you so is the training only for LZMA because I've tested with LZMA because it's the most powerful in the toolbox but for instance with something like bzip2 which is a block sorting algorithm it's clear it won't work because everything is sorted in huge blocks so the training only applies to certain compression output which are streaming compliance if you want so the output stream is in the same order than the input stream except it is squeezed it's smaller it was the ambition although it's a stream yeah but then you would have the perfect compression you would have one instead of it building up or learning LZMA is adaptive so as you get through whatever stream it's you get better and better so by the time you come to the end you really know this file really well and then through away you're the date instead of transmitting it or doing whatever you want to do it if you then started processing the same file again you would be you would have you would have just because you're missing the information on the other side the the the the name where basically the idea is you are given a huge huge file it's supposed to produce the best ever possible compressor but the trick is you have to include the compressor itself in the size estimate because I say I guess you wanted to to refeed or you restart here the compression I can tell you what will happen the dictionary probably you have set it as long as possible to perhaps contain the full decoded stuff the compression the string matcher will notice that you just put the same so it will send a big just a single distance length code just the length of the or your uncompressed thing and with the distance as well that's it just one code or perhaps a chunk if the length are limited a string of length but it won't help if you you want to send to machine bay machine bay doesn't know of your your string so I show you again the scheme so oh I have two instances too much ok so what you need to send is this data in all caps so oh we data dash yes data dash is a mixture of real data and some sort of dictionary or something like that not at all it's the data is just the compressed version of data full full point absolutely no no no the first here the first step here the first step you feed the training data first into the compressor and then you put the yeah but you you don't know in advance where the data if you if you did you would need to exactly if you if the training and data are the same basically if you you don't need to transmit it because the training data is is there silently training data okay one of the reasons is so good at source code and web fonts and exactly it ships with two dictionaries one dictionary of html javascript basically ncss i think and one of web fonts and it's hidden there are a number of bytes used during the initialization they are supposed to be set like this these are actually two dictionaries impressive and it's shipped as well okay it was yeah okay so you have some kind of preset dictionary when you want to decompress xml data or specific data yeah okay so yeah they have specific dictionaries yeah you showed that your work on this personal project for like 20 years do you have a secret of longevity for your project do you have a recommendation for other projects yeah so the recommendation is go for it because it's basically stainless you can use your code after years it's almost boring because you have a new version you don't need to adapt to a new version of a compiler or whatever you can do instead of fixing these compiler things or compatibility things you can start all the projects or go ahead with yeah exactly you didn't say that but you've got this on github as well yes because I've noticed there's some fashion shifts to yeah it was in the history but if you go at the end of the hidden slides sorry oh no no no here yeah did I forget I forget the github oh it's terrible but you will find you google github Zipeda should be should be okay