 Okay, so basically, I'm very interested in this topic deep learning and there's a couple of papers that came out recently basically The main one really is this one deep compression Which basically looking at compressing that the node the weight Within those neural network by pruning training quantization and Huffman coding Looking at the archive website actually I find that there's actually quite a lot of other papers that actually A lot of those those concept that is being presented recently This paper is from a recent Italian conference learning a plantation and then they also actually Presented even more recently some inference engine. So actually they're building their own deep learning box with ASIC Actually, it's really a study and then a technical report which came very recently is really looking at having those nodes very very very compressed and actually ultimately They target to actually run things on mobile phone or actually even like a embedded system, etc a lot of the the motivation behind those those those papers or this type of direction of research is really to Reduce the power consumption Looking at application Now is the reduction for example And those are mainly they're looking at the vehicle System that you will have in your car. They're taking the road. They're taking pedestrian So actually they really emphasis on the fact that they will reduce the training set the training The weights of those the size of those weights to actually have things in the mobile mobile systems Training training quantization and nothing there's nothing new really in those topics People who I work used to work in no one at work working in your network knows those topics for a long time initialization Is that it's a point niche application for those deep learning? Systems so for those who basically don't know about I think I can easily summarize it as a Most of the time you start with an image if it's basically image recognition Then you go through a lot of different stage different layers and for these you have convolution pooling Operation which basically look at small parts of the of the image and basically could go to the other stage and ultimately you end up with a classification I will not go into too much detail, but those are the sort of reference paper from loop one Time So a lot of Refer to some of those net In time there was this one the learn net which is the first successful application of neural network by this guy Afterwards they actually look at all the networks Popularized by the field of computer vision VGG stands for video group It's it's a group basically in Oxford in UK. They basically provide All these networks already trained and they put all these weights on the internet and you can she download them Have your own system all these deep learning is actually very Shared a lot of the data a lot of the tools actually really available for people to try However, you tend to have a need for some good hardware to really try things out So those those Alex Betten VG net will see them in into the So the papers are basically from those songhands And you can see already that there's actually a NVIDIA involvement because basically a lot of those neural network Actually, I'm doing GPU Later on in some of the paper they compare GPU FPGA CPU GPU FPGA and ASIC and Ultimately, we can access the GPU you can access The seat goes your own CPU your GPU is by the T buying cards FPGA you can buy FPGA and program ASIC is the really the ultimate level where they even basically Optimize those core etc. And actually make a sec to actually so the third paper is really Developing an ASIC so the paper is basically introducing tuning basically quantizing And also there's something I've been from a network Is also weight-sharing there's actually techniques to actually even more reduced and quantizing the precision of those Those weights is to actually have a pool of weights and actually being shared So this reviews even more help and coming actually is not really explained that much, but it's basically really data compression This is where you really Encode I mean It's a dictionary and afterwards you basically expect Then they do some experiment with those Networks and after was a discuss all sorts of speed up ratio central etc. So this is a quite quite an interesting paper What you have to really Understand is that all of these stage Can be done offline in a sense you can get a huge amount of data Which is a neural net model from the internet and actually process the weights for each of those and then they show really in the paper that they can really dramatically reduce the The size of those of those weights so the three-chance progression is repipeline To give you an idea the pruning reduced the number of weight by ten quantization by twenty seven to twenty two thirty one and afterwards the ultimately when they compress they have really a Confession of forty nine. So really it's a it's a dramatic dramatic amount of reduction of the weights The beauty I would say for this is that when actually test those They basically said that there is no Accuracy loss. So really for people who in the early stage had those massive Networks were struggling to actually you know train them etc. And then ultimately some people say oh with those three three methods We can reduce and actually the loss is is actually Really not there. There's no loss in the sense So we can see that this will facilitate greatly the implementation on on mobile systems or system with much reduced resources so some of the Parties for example the weight sharing and upon Telecom session, so they use some sort of encoding to basically Have some sort of index in which they can add they can access Access the weights so they have it they explain in the paper all these sort of techniques when they basically have an index and For this one they basically use a century. So they let's say they have a bunch of Weights and they try to cluster them into three two one zero and afterwards they look at the gradient They have the same and then they reduce and ultimately they end up with a century thing In the same it's it's very gross. Actually, it's not really a very Say elegant processing but ultimately having some sort of crossing at this reduce the The weights so actually ultimately whatever method they use this Is good enough for them to use later without having any loss so from a single process in front of you the This stage is quite Yeah, I mean from a you know from the the original Like weights, I mean I could see all of those weights. Yes. Yeah, it's really dramatic You know you end up with basically a century that the end with So they also look at the different different methods for this Centralization with some sort of distribution. They basically look here when it's like let's say you don't know uniform or they try to find some uniformity in the weights I Find the paper from a single processing point of view. It's really applying those three techniques And and at the end they end up with, you know, quite some massive Reduction in the way. So it's not the major major paper if actually people were doing no network study general networks research before these people are applying three methods and and dramatically reducing the However from a point of view of let's say performance they can really Reduce the I said the weights from let's say, you know one gigabyte here to 27 kB so here you can really see that from a machine that Would require a lot of memory initially to implement as neural network now just requires a very small Memory footprint so those those compression rate are really Significant I think for that This is the VG Yeah, yeah does this conversion rate also implies that the Size of the network and length of computation is also reduced or is it just no No, they basically strain the neural network with a full precision and later on reduce those weights The question is basically when they quantize the weights doesn't mean that also they spend less time computing the results Yes, yes, because down the line when you actually have an impression that that requires Some of the weights go from 32 bit to let's say six bits Yeah, of course the if you have a hard if you implement it on an FPGA or an ASIC in five bits It's a multiplication, etc. It gives me it uses less resources. However. I don't know whether Implemented on a GPU can really have those weights into a four-bit format For this I'm not exactly sure usually when you work in CUDA, you have an integer you have those fixed fix Word length for your variables. So somehow here it's really to get to what FPGA or ASIC having a real reduced Word length for all your coefficients So then your dense matrix becomes a sparse Now if we can write your code So pruning is removing not quantizing is really the weights and then afterwards compression is really ultimately how you access those weights, so If you actually have all your weights Hoffman compressed or Transform, let's say are compressed. Afterward, you still need some mechanics or some modules to actually uncompress them and actually pick the right one So somewhere there the the wind in terms of space from a compression point of view is there But ultimately you pay the price with a bit of the compression. However, the pruning is a real removal of weights and Quantization is really a real removal of bits to represent those those weights I mean those those Those figures here of compression rate are pretty big I mean if we when you know something here is half a gig down to 11 megabits from a platform point of view, you really can see that You will have The original one they were they actually train on PC it should be 32 bits. So you actually have a huge of a room for actually, you know precision but down the line the task The task for which those neural network were trained do not require so much precision in sense So this could be seen as an achievement in terms of compression But down the line if you look at it from an application perspective It means that the application of these object recognition history do not require 32 bits in the first place So it depends on how you know No, you were using a February for going at 50 miles an hour basically this is what you know and the price of this in terms of hardware No becomes Yeah, yeah, yeah, I mean this is this is this is used for the training So somehow I just wonder whether you could still train with such neural networks with that amount of Okay, it's good that way you have the original data you train and you obtain a normal network if you prove that then what I do not need So much precision to actually perform Then you could ideally train a normal network with that amount of Because the loss is not there. They said no loss in accuracy No, yeah Interesting part so if you imagine your brain if it could be so compressed 40 times it would be you know super super Here the issue is here if you have a neural net which is trained with that amount of data for the weights And actually they can show that this amount of data for the weights are only necessary to obtain The ratio here is not relevant but to achieve the same recognition It means that ultimately you could train a normal net with that amount of weights 11 and still have the same Because here we're actually using the same neural net is that that we are pruning and by pruning Quantising and compressing. We still actually have the same The same performance his point is that every step gradient step you make might be too small to be represented by a Oh Maybe Escape to an actual value if you can't continue these chopping it to zero In a training yes, but but I think that maybe When people train and then then when you train your neural net you tend to let's say have an integer on your machine It's a 32 bit, but do you ever say or let's put that 16 bit and see actually how I think How is just doing GPUs with 16 yeah now. Yeah, so that's 16 bits is fine That's basically what you see that that basically what so These are examples in in those paper where they look at the different The different networks and afterwards they look at different layers and they see how much a weights how much they actually gain by pruning and Quantising and see the weight bits become six four Etc. And at the end they end up with this 40 39 so from this you can infer that training Neural network with floating points is sounds generally wasteful already That's where yeah The question is like the bragging tries to say that you can train it You may you may be able to train it with lesser precision But not quantization after that you take that Yeah, but how do you most and the clustering the came in question that they use is with whatever the bit Six feet or whatever and then after that So the gradient problem doesn't exist because you're training with Yeah, yeah, but that's what that's what I was saying that if you actually train them with a lesser precision Martin was saying actually that lesser person who actually interfere with her with the training in a sense Yeah, it's either you cannot converge or you take a very very long time to converge in In the in mathematical sense, I would think I Mean those those paper mean those petit paper It's really from one which from one trainer on network. How can they actually reduce the size? to in in the in the not not in the training phase but in the application phase To actually you know obtain those smaller in size and actually still having the same performance So, you know and then the line those those those those paper Do not talk about the training that they basically Talk about the reduction without loss a reduction of the weights without having a loss in precision when you use them and then the line they actually I would say emphasize the fact that basically it's to without having a much less a much smaller Footprint in terms of memory for those weights meaning that you don't need to use external RAM. You can use internal RAM And it will be basically easier for mobile phone application. So the The argument is is there from from that point of view not not from a mathematical point of view or for a Neural network point of view is someone who would be doing neural network. This paper has no real It doesn't actually demonstrate that much. It just demonstrate that you can actually reduce There's loads of papers on on reducing weights initialization of the weights in the first place, etc Those people tend to be a bit more of an application and target to us. Yeah So other other networks actually have a more layers So actually they they have I would say more room to to reduce the weights and down the line some of them You know get 35 etc. Actually the the nice thing is that they They really try to see you know the influence between the different different algorithm. I mean the different Reduction here either which is you know Hoffman plus something or PNQ So for this you when you actually have the access to the tools or to have some of those networks You know, it's not difficult to actually get the get the weights and actually end up pruning them, etc And see what what is the what happened some of the Latest in or more complicated the network have even more basically layers and here you and that we 137 and they basically Get 2% of it. She's like an improvement of 49. I think you know in I mean in certain domain Reducing the the whole weight of a numerator by 49 is is quite a I wouldn't say an achievement I would say that maybe the data was hyper redundant in the first place or was you know had a lot of That's the question. So how far can you go with training this in the representation or But then you Always allow for quantized say 32 bit not really go down to six bits than maybe it's the works. I Guess this is the easy way you have a huge You know huge network trained and you put them like with training You normally start with the rather big step size, which means that you also start with Relatively low accuracy So then you would not notice and then there are adaptive training schedules that basically Decrease the step size and maybe also increase accuracy. So why not? So for this they basically look at also at the amount of accuracy loss versus the amount of compression So this is really gradual. So basically if you really compress a certain number, you really have a gradual Reduction or not? Some of those paper I mean some of those the curve like this. There's not much detail. So sometimes it's a bit and you know They compare this to SVD like a quite standard old method in a sense So it's not for me. It's not always a very objective Objective the way they present data, but then as a researcher, I know that sometimes you want to present data in a way that you Basically, I think it's it's if you if you reduce Yeah, yeah, yeah, yeah, yeah, yeah, so basically The accuracy of the thing They should drop the word loss No, no, no, it doesn't gain accuracy It's the accuracy of what they had in the first place and actually the more you quantize or the more you reduce that You will lost that accuracy. So actually it's going backwards Yeah, I might believe it when it comes to my career. Yeah, it's like a regularization effect Especially if they know it doesn't get more accurate There is some case of So my profit has been trying to do SVD and then tuning it further and then you get Yeah, but this this curve here do it's it's it's from a normal a maximum accuracy down to It's the other round. It's not you don't improve the accuracy. They have a normal net which is very accurate Yeah, okay. I was wondering if it's the other way. Okay Yeah, afterwards, there's a lot of you know printing doesn't hurt quantization. They try to see the sort of effects from one to another They're looking at three beats four beats seven beats eight beats, so I'm quite surprised for me even from a normal little point of view how you get some, you know some Some still some good values with only six beats compared to, you know, a 32-bit the It's really like like like, you know cutting cutting really really harsh I think I'm not sure if they said but usually it's a floating point and afterwards you you basically reduce the accuracy. So I think But then the way the red should be 32-bit integer Okay Most of the time you you train no network in the floating point and I say what you try to Yeah, do you really want to train the clock eating point, which is much much more expensive than say Yeah, it's more expensive if you have the hardware that actually do a training in four bits How do you train your hardware in four bits if your architecture, which is a CPU or GPU is not to forbid You see it goes the other way you can say The original is 32 bits because it's a GPU or CPU these are the standard sort of word length you could program six-bit training method on the FPGA But then it's probably easier to train it and later I mean this is a sort of natural where you have a massive amount of data You get a resource you chop chop chop you check if the results is better And if it's better, then you say oh, this is actually you know that the month of that I can reduce interesting because Remember that the on your mobile phone, etc. It's gonna be an arm core. So this will still be a CPU in a sense We said that the claim is that without the GPUs and super computing powers of today We would not be able to train deep move learning neural networks It shows here. Well, we could if we knew what is that in general accuracy Down here, the only say is accuracy, but they never state Why the time it took for them to reach this For this it's it's no no the It's it's you can time it because you actually have the tools to do it yourself You can time how long it takes to train those networks at the full precision. I would say yeah No, but this one this one actually you don't train at that precision You have the full precision and you cut some part which you say oh, this may be not useful And when you actually use the same network with those missing you still get the same performance So you say that precision is actually not used not useful anymore instead of a 32 Bit which you expect to be enough to represent that precision. It's actually six bits only six bits accuracy is actually valid Technically we are truncating some part of the digits In a sense. Yes, certain weights will have some truncation where others will not have yeah, so how do you translate? How do you reduce a day to beat space a teacher into? You normalize them and afterwards you chop them basically Normalize and contact them. Yeah, you need to often. Yeah. Yeah, otherwise you cannot say a 32 chop to 16 I mean we have to but somehow but when you scale them to normalize one to zero Afterwards when you actually, you know 0.5 become one bit You still end up Yeah, so it would be like this you would basically specify your range Maybe only from this then you will say that this is a so it becomes a non uniformity from a range point of view So afterwards is a classic CPU versus GPU versus I guess this ticket one is a GPU processor in sense. I mean So afterwards all the numbers in green green is a GPU pruned is you know multiple by 90 everything is basically higher. So This is what is expected from a research paper where all your results are always You know somewhere there somehow, you know this one and this one, you know becomes a in the paper where they actually present Really a hardware system. It's the same there. You know the numbers are nice and like Because I mean 40 times improvement No, no, no, it's 40 times smaller smaller. Yeah, yeah, you reduce by 40 times You could imagine that even 10 times you can speed up the learning process I think it's like this you had metal before you have Kevlar, okay And then you say these protect as much as they could at that time Metal in the past Kevlar now then you look at the weight the real weight metal is more heavy and Kevlar is lighter So you say it's lighter and also because it's lighter my cargo can go faster In a sense, you know, I tried to make that sort of you know It's not it's not fast Then the neural network will be faster in a sense if you implement your hardware in you know in a four-bit precision here When you have to implement them, but My concern in those papers that when I talk about GPU, how do you actually implement those in GPU? You cannot the GPU doesn't have those Number of bits allocation for for those integers. So somewhere there there is some gray area But the GPU is still very classic in in in a in a core where when they show paper, I mean the last paper So if I continue, this is you know the index and especially for the code word They're looking at the code word the weights index and the code word So they see the sort of what is needed in terms of percentage in those weights Which because the way that basically the encode is not only the way is the index now They're actually accessing tables to compress things, etc. So they have this this part for me. It's not really really I would say very the most interesting part of the thing Still, you know parameters, etc they really Really the issue here is the top one error Which is in the in terms of those competition organized for those neural network even with you know that amount of Weight, they still stay in those top one So they're reducing greatly the words the weights of the neural network But they still stay in that top one meaning that you can still recognize that's many percent, etc, etc Another paper which was basically From the same more or less the same team This is a nips conference. So compared to the previous one I would say nips is a higher caliber conference and this one is a paper that was From the year before and these paper in the same trench basically provide quite similar Results For me the interesting thing is this with nips. You can actually get the review of those So you can really see that from a submission of the paper. What are the top people in the world? Criticizing that paper then you can read and see you know, what was the issue? There's a couple of typo, etc, but this is this is the real research. It's when you get the reviewer, you know For which paper Of these of these nips you mean Yeah, yeah, I didn't see you It was just I can cross this one actually when I tried to download it as I saw this I didn't know Yeah, how do they do pruning and Huffman impression the the pruning is done Basically by just removing certain Connection retraining it and see basically what what is the it is not it's not you threshold certain Certain weights basically is the connectivity the the weights which are the least It's a bit like it's a bit like future selection basically Yeah, yeah, so the more The more the bigger weights the more importance they put into the thing. It's a classic For this one that just the paragraph is very small is just said that we're using a Huffman, etc They don't say much from my point of view when when when you Still still it reduced it further reduced the weights However The it's good to show, you know a huge reduction like this later on if you actually implement it in real You need to do the decoding the you know always decoding because everything is so These are not What maximum brain damage Those method are not I mean they're not Yeah, new that they are actually Before you prune this you actually have another method in front which is initialization on normal network Which has also is another field or so which is which is a Interesting part of this pruning process is that since they have to retrain it it makes the training much much longer than I know Because basically they train it the first time then they prune and train again Yes So I don't think they retrain I don't think they return the prune the connection and And then they test those those those train weight with the prune and I don't think they retrain I don't think they retrain Yeah, I don't think so because this is here that are reduction I don't think they're The iterative part is you prune You test with the weights and then you reprune Reprune the weight the remaining loss to kind of compensate for the fact That So the second paper, I mean the second paper is one of the same but the second paper and sees more into this I mean this is the The really show behind his energy. They said I you know using a GPU you actually consume a lot of energy And the fact that if you actually reduce the access to the DRAM Which costs that much amount of picot joule per etc Then ultimately by by reducing the weights You actually don't have access to the to the external RAM which means that you have internal RAM and it's faster it It makes sense in this it makes sense it makes sense from a from actually you size power Speed you know complexity you can play with all sorts of factors to actually Justified your your your methodology here And then this is the references from this mark or what which has a table of those stands for the VSI group These are It's important because they actually emphasize on the mobility the fact that they want to have it in the mobile Mobile things So they have the same music training before training before printing and then after basically some of those are removed and a street training pipeline Same Yeah, at that time actually it seems to be the same Netrice but she the compression rate seem to be smaller So I think this one only you doesn't use the I think this one only use Yeah, it's only the yeah only the pruning only the pruning maybe So actually maybe you know afterwards they basically add a third the this of man of man idea would The pruning and the quantizing is really sort of logic type of thing or easy to do the Without having any any more things to do at the decoding of the weight Let's say because you basically remove the you know the complete the accuracy of those bad the Huffman Encoding you need the decoder afterwards or some sort of the compressor thing. So these actually are less Yeah, yeah, yeah, I mean it's like you know over the years you improve your method and and you have more more So they use the same same neural network and end up with 16 here They're also looking at the amount of weights and also the computation. So they have really to They end up with really to metric to actually assess how good they are and They're also looking at also per layer. This is a sort of representation Of the month of things that that they would have remind Versus the month that they can prune so you can you can maybe see that you know the convolution part Actually is not pruned at all, but all the others are so that there is certain layers Which are more likely to be pruned than others. Maybe I don't know they don't really do any sort of conclusion, but then Down the line they're looking for those those numbers Same same curve as before with this accuracy loss. It's like a repetitive thing then like you said, you know the basically they're looking at the sort of Centroid and enough to try to to to cater for specific range for the for the square fissions Overall, I find that you know naïve cut as we and all these are also Standard or maybe like prior art in the sense of those methods and then theirs is actually Not better than those in that top floor, but at least in terms of reduction. It's actually higher So it's always, you know, you gain on one side, but you cannot win on all All right From a hardware point of view This is also a paper from those group, which basically They define they design an efficient inferring engine. So this is really going into So windows paper have a lot of income and now if you're doing research remember that, you know Every every six months when you write the paper you can also use that same table because basically nothing has changed And if you put if you put that reference as a vlsi stanford professor I'm not sure if it's recorded. I don't want to be oriented too much better. You know, it's the same same game here Same neural nets basically faster faster faster here. However, they they really design, you know, some some some acic So here you are going into Into the real hardware thing when they look at the model. They say, oh, we're gonna actually, you know, reduce things severely They talk about this this is more interesting in the the end up with a processing element which later they define like this They have a non zero lead node detection here. We have a north south east west. You end up with a sort of a mesh of things This becomes, you know, real design At the end when you have the word acic, you realize that, you know, there's money Invested in this why because because I think those those companies they do the pre sort of pre analysis of what could be in the future design of of component that could be sold to Whatever, apollo, whoever but is those pre study at the moment when people all people can buy a gpu car and do all sorts You are over that barrier of that gpu for accessible to everybody and you move into the space Of having your own acic. So if you have a paper with this picture Then you realize you're well funded in the first place and also you look at those the technology 45 nanometer here. This is quite quite a good thing I understand that that place like stanford has contracted service for this kind of I think that I think the nissu probably beg to actually Implement some of their stuff to have you know publicity and and and experience and and and poach those guys to actually go for them And you know Down the line here. It's interesting. It's all these no read point Spass metric single acic that they produce they they have more than one project and more than one classic that they send regular These are for me. These are pilot study in a sense They are contracted for many pilots that is basically yeah So I understand that here the problem is to to contract enough to make the boundary interested. Yeah If you are in an environment where there's a lot of soft money for High-tech industry doing that kind of thing which you know isn't Deep learning is all this new technology Tesla Tesla is using this for some of their pedestrian detection and they refer in the paper You can see that there's no a couple of billions to actually you know Spend into those designs is is not it depends on the scale of of what you see, you know what you What what you do with it? So here you have this cpu This is the the reference after was dpu compressed gpu dense. So dense and compress is those machines like you said you basically Uh, the more you do Some of those operations are different M gpu I think is multiple gpu and here this is their system and of course their system, you know Multi by by two 24,000 thing Here those numbers are are because of the amount of money and the amount of design that you do and that The cpu the cpu is the base you can you can design the base, but here all of these are we're not designed to do No network in the first place or not, you know, they can adapt to do no no network But they were not interestingly designed for this. This is a customized machine for that particular Test Chats downstairs Alex like 119,000. Yeah You give them a phone call How do you how do you if you buy how do you reproduce such a research? You cannot because this is their design and you can reproduce one to those because the the gpu is you know Those numbers you cannot you cannot really For them it's like one one But this is this is what a big machine this is not uh, it's not uh So, you know the normal cpu the gpu titan the the tigra and then there There's also theoretical time and actual time. So This is you know probably back from a simulator. They can assess, you know, or at least estimate, you know, whether it's actually Useful to make it in the first place, you know, if those numbers are not so good, then they wouldn't be this afterwards You can also Because you have a very highly specialized machine Which consumes and the gpu and far less means Less than the gpu and far less than the cpu the first place then you can really look at the you know the You can scale the let's say some of the width of your memory and do also sort of of happy thing and the precision, etc Down the line It's also about the speed up When you actually have For a particular let's say neural network You have actually one processing element and then you scale scale scale And you can see that actually everything's killed linearly. So, you know, the the more the more cheap they put at it The better it's going to be if I may interrupt first. Well, this is why you want to turn off course To produce graphs like this. Yeah When this is a particular task, it's it's not, you know, it cannot run anything else than then But I understand that basically what neural network training is is Multiplication addition and do it again. So it's basically a very simple vector Engine and and this takes takes advantage of the parallelism basically Yeah, and and if you do some kind of addition multiplication on matrices, you should be able to use Yeah, but your your cpu and your gpu If you look at your gpu, it has much more primitive in the sense to do more things Possibly. I mean There are a lot of things that you can do with matrix multiplication and addition especially sparse matrices The the the issue of such paper is really to show that, you know The higher the number Of processing and those processes are really specialized. They are, you know, everything Is better and better and better These paper really look here, you know, sorry Yeah, the d ram blah blah blah blah and then afterwards is, you know, this is tend to be external ram where this tends to be Sometimes faster ram, but still this is a That's how they are. Yeah So the the four the four 300 megs of static ram is much much faster But here it's external. So here you you integrate everything basically you integrate everything So that's probably most of the surface. Yeah, this one. Uh, dian now that this is, uh Is it the chinese? Yeah, you know, is it bao is it or one of the chinese company also Try try to do some of this with ASIC So actually, uh, yeah, this is Yeah, I think but yeah, I assume that this is I read a couple of papers. I assume this is baidu and this is yeah, I don't think anyone else. No, no So, I mean, it's not entirely fair to compare but if you think about your processor The most of the surface of the chip is cash Actually, not the core But that's exactly what I'm saying the and it's just yeah makes yeah All of these are not comparable from a From a task point of view. They were not designed for this But from an accessibility point of view, you can buy this you can buy this you can buy an fpga Up to here. This is not accessible for the normal person But the efficiency energy efficiency is demonstrate that, you know, it's very efficient And it leads to the conclusion that ultimately you will have an ASIC in your watch that do all sorts of Of not pretty string recognition and god knows what based on this You know, this is those papers are sort of, you know showing in a light that This is viable later on for for for production in a sense. These are for me. These are really, uh Pre-study to really, you know, demonstrate and make the whole thing saying, you know Deep learning will be used because what one interesting thing here is a one bit fixed this one. I'm not sure Is that you know one bit? I know suddenly it's only here for bits. I understand but the one bit There might be specific thing I didn't check that paper yet, but there might be specific thing that that that that is playing here The last paper is basically Same people deep scale. So this is suddenly you have people from the scale being part of the authors And you can actually download this squeeze net from the GitHub the github and you really have half a meg of Of of of basically no network train weights And this is still smaller than they say alex net, which was the the original Network which was you know winning in a computer. So those papers are quite similar Some of them we replace those three by three filter by one by one filter one by one filter for me It's a multiplication. It's not a filter anymore. But you know, this is the the language of research It's a filter still and they do some sensitivity analysis. So uh, this I think maybe I'd like to you know, take if Those paper I mean gpu I can go through the slide, but they're more or less the same It's the same flavor. But is the is the most recent paper they have and they really squeeze to half a meg Half a meg becomes really really Interesting because you can really even on an fpga you can really start to to make it, you know happening They they look at the you know the different fire fire, etc. And then they reduce certain aspect there with some convolution one Afterwards they they basically look at the input image the size of it so many layers different type of operators And at the end from a million point 1.2 million they actually get to 400k So this is you know when they actually detail all the their addiction clear for me How much of this is automatic analysis and how much of this is like fine tuning without an overarching method or logic? it's not the the The the application of those is not really It's a from a mathematical point of view if you look at quantization my concentration is mathematical concept But it's not the the the aim is really to reduce the the the very specific network. Yeah Yes, but they are The idea here is that you know those Those network are We're provided for competition worldwide competition and the people who win are one where the people Who develop the best network, etc. Afterwards you are You are sitting on on the shoulders of of those people and say, you know The network that you actually design I can reduce the size of it and still show that the The accuracy when using it in a real scenario is still the same There is no loss in accuracy But the way the the size of it is is reduced. This this can be there is no acceptable rocks There's no acceptable They do not accept like one person like first No, they don't they don't improve on the accuracy of that network They don't like but they reduce the weight To up to a point where the accuracy The way the accuracy stay the same do not have any loss. So if they would reduce even more they will start to Lose accuracy from this original network and then it's not good So the achievement here is is a weight reduction I wish I could be having some weight reduction in methods These are already trained. Yes. Yeah. Yeah, they are trained in computers. Yes. Yeah computers and they just compress it to Yes. Yeah. Yeah, those are those are trained on on On data set which are shown at the beginning. So all the data set actually the data set you can get the the those Network either you get them as a configuration file for two or you get actually the weights If you want to implement that into your c you basically dump the weight in a certain format and you can actually run it run it on your A lot of it is you know, it's putting some gpu because you want to basically accelerate things but I mean the nice thing about all these research is really computer science in a sense that Everything is shared and and you can reproduce. This is a reputable uh research Having the hardware in the first place because it's not uh, I'm not sure how it takes. I mean, maybe when you demo it from your laptop um So it's the same, you know from 4.8 0.4. These are numbers which are more embedded system like so, ultimately, you know, they will be running on a an embedded system like Unbedded system basically so these are basically the my presentation or you know give Someone traditional on this the the interesting thing for me is that uh, if if I if I take one of these like this Uh, this fits into an fpga This fixed not nicely but fixed into the fpga. I If I get the the configuration of that of that neural network and I know the precision of each of those Each of those layers, etc. I can actually in theory Develop this on an fpga in an easier way because the Those weights are are minimized in size messes But the accuracy of those compared to a full blown one is the same So this is the the important thing you're not losing in in in accuracy Uh, so if you had a a pc running with four gig of weights trying to detect with a camera A pedestrian on the road the same neural networks, but with a reduced weight Uh system on a non-bedded system with a camera should be able to actually have the same accuracy It's a nice example Could be efficient in terms of power in terms of processing, etc. Yeah, because as soon as you prune you reduce the number of multiplication When you reduce the weight accuracy, you need a You need a customized hardware to do that with vision You know, if you have a 30 if you have a 32 bit multiplier on your hardware Whether your coefficient at 6 bit or 7 bit you will still end up, you know, those registers being Bigger in a sense So some of those fpga card I have at work have Like fabon plus dsp slides, which are blocks can do Multiplicating in one clock. So these are architecture, which you know become basically, uh logical to implement some of these Uh everything in parallel in a sense So ultimately I try to basically Martin If you give me five ten minutes So this is the this is the theory of the papers not by me. I'm only reporting and this is the practitioner This is the real. Yes. This is not my laptop. No, and I have nothing to just wave my hands around So I I saw these papers and I thought these are awesome Back in back when they came out and I thought well Another big glob of data that we have doing this neural network stuff when you're doing natural language is a word embedding So i'm not sure whether anyone I'm assuming people are familiar with what word embedding is because it comes up a lot Basically, you run across a huge corpus of text and look at which words come next to each other But by doing this repeatedly and refining a model you can get something where you can get Word similarity so you can say that Apple is near orange less near turnip Quite a long way from concrete So there's there's a whole variety of word similarity tasks you can do you can also do something called word analogy where you do king queen Sorry man woman king Who and by looking at the geometry of this you can find out it's queen And so basically you for one of these things A typical vocab could be 200 000 words And you'd have a 300 dimensional space And you'd store all of these at 32 bits because that's what everyone does and you've got a block of half a gig of Disc which you'd load in at the beginning of this thing and then you tell it So when I saw this I was like oh that could be compressed surely And so what I did is I I then did this it boils down to the centroid thing Um, and so we can compress the 32 bits down to three bits Okay, so there's a 10x in there and you get the same basic scores Basically when you uncompress it you get the same metrics you began with okay, so in a way it's unsurprising So now I submitted I was going with I see I see a lot for this except that there was an additional step so What this is but by doing this kind of Lloyd compression or this Centroids compression You're really doing it like a png. I mean you're you're taking an image and you're compressing it Like crazy and then you expand it and you've got the same thing But if you think about jpegs you're getting compression and you're you're Because you you're getting compression because you understand the visual processing Png is like a mathematical thing Jpegs are compressed because they they're slightly lossy But they work because they understand that your visual Core how you're interpreting the color map whatever So what I really wanted to do and I didn't get done in time for icl I've now submitted to some Japanese conference Um is say well for the since I can now take these 300 reels down to three bits each So 900 bits so if I treat the 900 bits as like a budget of how much data I can have for each vector What can I store in 900 bits another clear thing that you want to do with words? And this is what this is a psychological thing is that when you have a word people will associate 10 or 20 or 50 different aspect positive aspects to the word Like a dog. It'll be it will make noises. It will be a pet. It will be furry. It will be obedient It'll be friendly Bunch of different things it'll be canine Right, but but you don't say well a dog does not have wheels. I mean that's the one thing I know about dogs, right? It's not So so people remember positive sparse aspects about words. Wouldn't it be better rather than this dense matrix of 300 to have positive sparse Impressions so with an if you if you think about it if suppose I have a thousand thousand and twenty four Different positive aspects, which is 10 bits. I say well, I can date do with three bits resolution on each of them This is 13 bits I've got 900 to do. I've got like 60 sparse elements within this thousand string Look at the other way suppose I had a string which is all zero apart from 60 ones of 60 elements I can then quantize it by storing the address And a weight for each one and also depending on the order In which I store them I can have instead of I can store the first one with high resolution And then kind of store a step down amount So the first one might be the same they're always going to be less because they're in order You can compress this thing into a sparse And so the the nice thing about this is yes, it does work So basically you can take you get a matrix which I say I'm going to have this as sparse I'll have a big kind of dictionary matrix and it will multiply up to give my original matrix You gradually optimize your sparse thing and your dictionary to continually equal the end result And you can it's quite easy to see for you can measure that this gives you the same results But the interesting thing is if you then throw away the dictionary so you no longer have any of this data You just have the sparse stuff now. What have we got? The interesting thing is this also works with the same It's also if you throw away all of the data So now you've just got this sparse thing left over It now knows that apple and orange are the same that this word some variety scores are the same case The the nice by-product of this is that The sparse vectors actually mean something so in the dense in the dense array Each each column of this doesn't really mean anything It's like in some really weird 300 dimensional space Who knows what these the directions don't mean anything. It's like symmetrical This sparse thing because everything's zero apart from a few things when you look at what words respond to this one direction You can see. Oh, this this is about vehicles. This one's about age. This one's about irresponsibility If you look at the word motorbike, you'll see that it's a combination of vehicles and age and lawbreaking and racing And this comes naturally out of just reading in my case wikipedia Again and again and again, so there's no labeling going on, but you can pick out these nice words. These are nice word clusters Um, this is like to be continued. So the one thing it doesn't do or it only does badly Is the word analogy thing which is king queen man woman king queen If you say the way it's well done in like a big vector space is you say well queen is King plus man or queen minus man basically It's a vet tradition. It's like a geometry rather than a directionality So the geometry is all messed up when you take these sparse vector or these positive sparse vectors because then this minus sign doesn't work And so this is what I'm now going to be thinking about is maybe it's more like brulean operations or set operations which you have to think about and that's Anyway, that that's then that that's my next salami size of this paper As to how to make the geometry thing work But then there's another another possibility. We could say well given that I can extract all of these really juicy um directions Maybe there's a 4 000 directions. There's only 12 bits Yeah, very split there. So there's some interesting techniques about specification, which are developed and GPUized and stuff um So maybe we could use those as the kernel Like the basis for the next iteration through Wikipedia To like generate the next set of stuff. Maybe you could produce wordnet from nothing. Anyway That's that's the way I'm going quite a few applications when they're having just positive Presentation is enough Yeah, it's kind of I thought it was interesting because if you're already looking at the top positive markers or top positive and negative Yeah, but for the but for the dense embeddings It's all positive all negative anyway because you can rotate this whole thing you can rotate the space in any 300 dimensions. It's still the same thing. So these new embeddings that you have, uh, they're not binary, right? No, no, well, I haven't I haven't tried if I try binaryizing it It it gets worse. It gets worse. So there is some so there is some Value in having the the disc slightly different scales But I haven't really worked much on how to I was really working within a bit budget But could could I with the same bit budget give me something which told me information? And so that's like a minimum description length idea apparently yes Of course, they'll probably say no There's one called iconic in japan When do you plan to have an answer and thus you can refer the paper itself? Wouldn't that that would be presumptuous? But if it iterates again, I'll you'll probably be in rclr next year But that's the idea because I tried and they were quite enthusiastic about the idea of in compressing these embeddings But I had this bigger vision of learning something from this classification, which they probably thought could never work Learn the sparse embeddings directly. Yeah Well, I mean you could like have the sparse Cost them And then you know your But when you take the original matrix of word incidence, it's already sparse It's just vast and sparse. So this is folding it In a way the dense thing is is an interesting folding that it can be folded so much But this is just take this I can take any embedding Compress it up and I get these words these clusters coming up, which is really interesting Anyway May can I have questions? One what is the ratio of compression? Okay, so if I have terms of sparse matrix So if I have 1024 I can afford like 60 in order to get within the 900 So that's 605 Then if it's 404 or 96 then then I'm down to 1.2 percent something sparse So these things are very This is one of the reasons why the similarity becomes difficult In that if you've got two things which are kind of different It'll be like this and there'd be no overlap whatsoever So I was kind of reticent about getting really Similarity or very well, so this is maybe maybe it needs to go through some kind of Fuzzing process before I can do proper similarity But I like the idea of it being sparse And the second question I have is What is your Okay So this is this is an extension of a nips paper from last year Where what people typically do to make things sparse is they calculate all the coefficients and then they try and make things zero Or they impose an l1 constraint or something which kind of tends to do that But the problem is you can't really control how sparse it is after you do that So it may be that you end up with you know 25 sparse, which completely blows away my constraint So the way to do the way The neat way to do it is to look at all of these You make your constraint that you want only the top you have like a the dropout So dropout is something where you impose random Zeros across your your connections and it kind of works pretty well So what what they what this method does is essentially you say I only want k percent Of my connections to be live So I'll pick the top k And then I zero out everything else And so things that the weights will kind of compete very quick. We just take the the the ones come Well, it's actually top top k percent is not so easy because you don't know the scale So this is one of the things where gpu does not like this because it wants to do the actual algorithm wants to do a sort So this I think was why they didn't bother continuing with that paper because you need to do a sort to do a proper k percent But if you're willing to accept I want about a k percent, particularly since I'm doing this a hundred thousand times About k percent is good enough because on average some will be higher some will be lower You can just repeatedly say well, I guess Guess this level And then then it's a rate you can binary search on this thing and that five binary searches gives me Good enough. So you think there's a paper on uh, like selective dropout based on the like saturation or Is that what you're saying? Yes. Yes. So it's it's called like Like winner takes all auto encoders But they can't they can only do it on a cpu. So I did the gpu version I guess if you use the median to get top k percent You could get pretty close right one of the sorry I'll digress one of the interesting things is that the distribution going into this the weights You'd say all it's going to be normal, right? And so if I've got I can pick the median and I can pick the extreme And then you can kind of tell where my sparse is supposed to be because it's going to be normal What happens is because these things are kind of competing to get into the bucket The ones which are just below are competing to get into the bucket to be considered otherwise they're zero What happens is the distribution of it changes completely And so your assumption about extrema is completely wrong by like the third iteration And your whole gpu kind of like dies. I mean just The experiment dies because it's none of these assumptions are right So binary search ended up being the way to do So this is the kind of highly technical issues