 All right, thank you all for coming and thank you to the organizers for inviting me. So my name is Sumod, I lead a group doing computer vision and machine learning R&D at Soliton Technologies. We are based out of Bangalore itself, so jumping right in. So I think I have this habit of acknowledging certain set of things or places or people. So I think this time I acknowledge you in this for libraries that have helped me over the course of it, from IED Delhi, Clemson, Fremont and a bunch of libraries all around I guess. So and you know this is so taxpayer money you know doing a good job here, so I guess I hope. So all right, first of all, so what's new about you know why did this revolution, the first place happened right? Because neural networks were there from 1980s or so or you know neural networks back propagation was there from 80s onwards and people were trying out a lot of things in the interim. But we did not get the current, the stellar results that we see now, we did not get them earlier on. So the question is you know what happened in the last five years or so that we are able to get such results, so that's the first question. Now before that I think a big picture of this of the stock itself, so this is more of I guess a motivational talk in the sense that I think people at least you know me being part of that community, sometimes we tend to use neural networks or CNNs as black boxes. Sometimes that's okay, I mean you know certain kind of problems you probably can use it as a black box and you can get away with it, but certain cases that might not be the case and you want to understand actually why is it doing the thing it's doing and if you know why it's doing then you know how to twiddle it in order to make it do the things that you want it to do as opposed to it randomly doing something for you. So I think the big picture of this one is you know so we sort of want to understand some of the why's and what are some of the works you know some works that have tried to understand some of the why's and also how could we make some of the networks more interpretable is another question I'm interested in and the big picture being you know if he asks the why's then I think it'll be the networks that we train will be a lot more amenable for many many purposes and it can possibly take us to much more newer places than we possibly thought we could get into. Now the second point here I would like to add first is I'm not a mathematician by training so you know I have NV so so people like Tim and others who are much more knowledgeable in mathematics if I do say things that are not correct please please correct me and I'm definitely going into areas which are definitely not my strength so yeah with that with that disclaimer I'll start so so convolutional networks you know for some of the things that first of all it happened right why why it made it made a difference one of the things was that so first of all what are you know what are these networks right they are okay yeah okay so the the gist of this whole idea was that we knew I mean from 1950s onwards we knew how we had some idea of how brains were working or you know we knew why how okay we learned that certain neurons for example certain neurons in our eyes are actually sort of edge detectors okay so what are edges you know you have an image you have one variation for a color variation from you know a dark white for example that is a edge okay so a so we from you know neuroscientists are doing experiments on chimpanzees or monkeys I forget and cats in some cases yeah cats they learn that actually these are you know they are edge detectors there are some some of them are edge detectors and then from that and we also knew that these neurons there are billions of them in our brain and they all connect up and we also I think all the time we learned that they possibly have a hierarchy to them as well okay so one is we knew that some of these lower layer neurons possibly are edge detectors and we knew that maybe there's a hierarchy as well and there are a whole bunch of them and another thing we knew is that they connect you know what's called synapses or they have connections between the between each of the layers so now the networks itself that we created then they are what you call biology possible meaning they are not exactly replicating biology and the way neurons work in brain is extremely complicated and we know very little about them okay as of now what we know is that you know there is there is more these models called spiking neuronal models for example which are very very complicated and we still don't know how to train them so all that people said is you know let's forget the complicated ones let's simplify as much as we can so that we can reverse engineer the brain okay and so simplification was that you know okay you know what you got neurons you got connections and you know there's a hierarchy of some sort okay and that is the first the first initial ideas that people were were using that now with that you know they got in the 80s they got relatively good results what I mean by that is you know you might think that the first autonomous cars happened now actually that's not quite true you know you had a very miniature raised version many miniaturized meaning they are big cars but they work under very restricted conditions okay so you know you they had like this robot called Alvin at CMU which could just drive around CMU but it couldn't you know it couldn't do it couldn't handle new roads or a lot of other things okay so they could do certain things but you know after a point what we learned is that after about you know two or three hidden layers you cannot train these networks okay and that was the big problem that you know people had in the 80s and and there was we did not know a good way of solving that problem and what happened in the last couple of years some of the things is my perspective on why why has it been why have been why have we been able to solve some of the problems now let me let me give a big picture right I think most of you know here but some of the data sets you know like the Pascal visual object categorization challenge Pascal walk there there are about 10000 plus categories of objects and accuracy on you know what's called top five that is if you get if you have an object and you recognize it correctly and you are in that you know what you recognize is in the top five so I give an image of a cat and you say in in the first five predictions you said there's a cat there then you got it correct that's the top five recognition so so the top five you know it itself was very very low you know say in 2010 or so 2011 maybe and over time but you know I think 2012 maybe this was when Alex net the paper came up and with that they had substantial improvements and over time now I think we have reached almost like I think 90 95 percentage accuracy on these data sets and this unheard of you know in computer vision you never you know you never have such huge leaps you know suddenly in two years time so now the things that have what changed here one of the things was that what I what I'm calling is you know specialization of layers so they essentially the idea is that you know people were looking at the various different kind I mean initially you're looking at neurons and then you just said that you know all of them essentially are some they have some sort of activation function you know sigmoidal for example and you had a bunch of inputs to them and you know you have and the output comes out of the after the activation and that goes to the next neuron and so on so forth so but what they here figured out is you know if you specialize these layers and what some of the names you recall is convolution layer pooling layer etc and some of them have some biological possibility again so that specialization of layers happen and for example convolution if a set of convolutions along with the pooling stacked up together actually experimentally we figured out that actually gives good results okay that was one of the one of the things that are changed I'm not going to there are few things that happen in between which did not continue on like you know layer wise pre-training for example which people don't use so much nowadays then there are there are few communities like nlp where they use it but in vision I think it's lesser nowadays problem so that's one thing so you know specialization layers is one thing second is this this activation function called ReLU which I have it right here okay so it's first first one and so so the activation function that we before what we had was these are sigmoidal ones the blue curve that you see here and instead what so the reason the reasoning behind sigmoidal was that so they wanted the people you you wanted these networks activation function to be nonlinear so that they can have they can have more they can be more expressive that's the term we use the idea essentially being you know if you just stacked up linear layers all they learn is you know a linear combinations which are nothing not really anything fancy but just like you are it's a SVM just a hierarchical SVM some sense so because of which you wanted nonlinearity and and I think there's some good properties of the sigmoid which people are using sigmoid in the 80s and 90s probably and so one of the new thing that came up is this ReLU which essentially is you know this this red curve here now sure what what is the question yeah it goes away all the fact oh even ReLU is it oh it was okay very interesting so Tim is saying that this is exactly as the perceptron from 70s if I'm right or 60s maybe even earlier 50s or 40s yeah okay so and then there's a paper by Minsky and all in 70s where they said that perceptrons don't do well and and that was a that was a claim at that point so I think what happened a post which was that you know you had the the ReLU based activation which essentially is you know you have this nonlinearity this is one point of nonlinearity otherwise it's it's a constant and you know it's a linear function okay so but actually you know what they figured out was they actually gave good performance now there are two reasons for why is giving good performance or that is we think these are the two reasons why it gives one is that so with sigmoidal one of the problems that you have is that when you stack up too many layers the gradients that you are propagating actually either they vanish or they explode so that is either they go to all the way to zero or they go all the way to one after which there's no learning that's happening so what you wanted is these network to somehow you know have more space to learn you know even even if you had more number of layers that's one and second the second if so that you know is something you know really really helps because it's a it's a linear function at after a point and the other thing is that also these were very very inexpensive to compute gradients for because it's just you know you just store a value that's all what you what you had to do and therefore that was you know very very cheap to do so what happened now because of those two things together is that you could actually have larger number of layers and they could still learn with the more number of layers and the why why why can we learn more number of layers one because it doesn't explode or vanish and secondly it doesn't also you can further it also because you had this is a very cheap operation that meant the computational cost of doing putting stacking layers are not too high either they're actually quite cheaper relatively so that is the next next difference the the next the third difference so the third one was okay till till that point many of the machine learning algorithms in general and particularly for computer vision ones what we used to do is you had an image you know to say you know m by n or n by n image so what you did was you actually made it into a vector okay so you took you took them one row at a time and you know and and stretch it up out essentially now what happens when you stretch it out what happens is that your spatial information that you have right for example one pixel being close to another pixel from the next line for example that information is lost actually right and so now with some of these networks what happened what they did was they actually used a representation which actually allowed you to put all this together what's called a tensor representation now the idea of tensors are nothing but you know they are higher dimensional so your matrices are two dimensional and they can go if you extend them to higher dimensions that essentially what your tensors are now they have their own mathematics related to that and how you what you can do with them and whole bunch of other things along with that but so so when I say tensor representation I all I mean is that you know they are instead of they don't really use the mathematics of tensors really just representing it as tensors that's all what they're doing there are a few papers on using the mathematics but they're still I think what it's a little innocent at this point a few interesting works though so that's the third thing the fourth one is you know GP GPUs just having you know much powerful hardware which are which are which does very good with which can do quite well with numerical computations that we are looking at definitely help that's the next major thing so and finally you know you had availability of libraries you know which essentially had you know GP GPU back end and some sort of tensor base representation friend and essentially helped you as well that's the next major major difference so all of this together you know gave us much better performance so the next question is you know why do you need theory right so if you're building a house you know let's say you're building you're putting up you know bunch of bricks and you're building a house you probably don't need a lot of theory right but if you're building a skyscraper for example you definitely need theory because only then can you you know you can you understand all the all the structure of the problem that you have got and understand where you can you know where can you extend things where can you not extend things etc the other thing is you know so one important thing to note however is you know if you are if you're starting in neural networks or any any area in general I would say you know take a hacker approach what I mean by hacker here is slightly different from this hacker so this hacker is someone who's just experimenting with things right who's experimenting things to understand what is happening to keep his motivation going and then you know you start off you know start off your amateur you're trying out things and then you start you know mastering things so if you are someone starting out you know one way would have been you know you go and learn any algebra well you learn statistics well you learn after that you learn machine learning well and so on and so forth that's one one evolution the other would be you know you go take one problem try to solve it whatever techniques you see you might use them as black boxes and you start you know you'll teasing it up are saying okay I don't understand this part now well how does this work and you keep on digging deeper into it so I'm suggesting if you're doing if you're tackling it for the first time I'd say this is a latter approach which is you know try play with the problem so that your motivation is not destroyed immediately if you're one of those few people for whom you know learning systematically from the basics is something that you are okay with wonderful and you can take the former approach so that is one thing so so that this talk is sort of saying you know I'm going this is the part from going from here to here you know from journeyman to master kind of a locate is a idea and the other is also that you know and initially when you do a lot of experiments right so one of the things that you know commonly we hear nowadays and blogs and you know by people who are doing Kaggle for example etc is you know let's just do you know large number of experiments let's try out a lot of large number of things let's try various different parameters and let's see which which algorithm works which model works the best and choose pick that model right so that works to a you know that is very powerful okay random search is really powerful and that's just an example here so the gist of it is you know let's say if I had to find you know where the greed curve here and this red yellow curve here I want to see the peak of both of those in some sense okay I want to find out where the peak of those two happens so if I were to you know if I don't I don't know that up you know beforehand only by sampling I want to know where actually that exists so if I did a grid approach right I will take a grid and you know sample points grid wise I probably will find out you know these points right so and which is not really interesting okay whereas if I randomly searched I have a slightly better probability you know of finding it you know finding those some of the interesting peaks possibly okay there are some of them are quantization effects but you won't you won't go into them but anyways the big picture is that the random distance does have power at the same time you want to use you know some amount of care along with your random search one and secondly if you know the theory and you know if you have a history of previous knowledge you can connect all of them together and you can do a lot more with that and if you just directly are just purely doing random search so yeah and that's a nice quote by noise the fact that you know people who are actually working on things have a much better intuition of you know what what where might this all take take you to then people who are sitting out so anyway so now now coming to deep learning itself right why does the theory matter deep learning so here so first of all you know many a time the image for example now I'll be talking largely about you know from a image perspective but many of these are applicable to non-image data speech data or natural language processing data etc so so one is you know the fact that you know the the time the dimension that the original data lies in is a very very high dimensional data right let's for example you know a thousand by thousand pixels you know that's about you know a one million dimensional space that we are looking for some patterns the number of samples you have in the ten thousand samples they are very very tiny in this you know one million one million dimensional space okay so so first and foremost you know that you cannot really you know the amount of data needed for interpolation is way more higher than actually what you might have you know when you are when you are doing it the second is actually there's a very interesting paper by from Google by Christian Sigiri and all so the what they did was that you know they took these images and they added minor perturbations to those images okay now so this algorithm by default is working well it's giving you 95% age top five categories and error for example but after this perturbation this image almost looks exactly like this previous image right do you all agree here but the algorithm thinks this is a ostrich with a high high confidence okay and that is true for all these images okay with these minor perturbations it even though it looks exactly the same to you the algorithm thinks it's ostriches with a high very high confidence okay now these were adversarial examples okay you looked at the algorithm and you saw you know what are the values I need to change Twiddle in order to make it you know give a high conference for ostrich but this actually is telling something interesting about the algorithm right it's telling us that sometimes it might give you things that are completely nonsensical and you know and you have to be careful about that right so that's another another major thing we would like to address another is that you know you're training your algorithm and the explosion of parameters and the models and architecture you need to try out right and if you just you know look up on look up on archive for example the number of ideas and papers that have come up in the last you know maybe two or three years is just phenomenal right how do you choose which architecture to try out right you try out algorithm by you know by afternoon today you might have a new architecture already published in archive so how do you go about that the next one is you know how did your algorithm actually converge or did it get stuck in a saddle point and other is you know you are augmenting data now what kind of data do you need to augment so what I mean by augmentation here is let's say I have a you know I'm doing handwritten character recognition so I have you know five for example written now what one of the things that you do to make the algorithms do well is to make them augment the data the idea is that you know you stretch them you share them you allow you know and you maybe and to apply different kind of transformations which actually you will still consider them as you know the same class but by giving a lot more data by augmenting them to the data set you have you hope that the algorithm will do better but the question is what kind of augmentation would you do right there are very large number of different augmentations you can do which one should you do the number of layers right you can add large number of layers you know but what should you do there and there are different kinds of activation functions you had different kind of optimization algorithms the choices are way too many right and some cases you know this is this is from the TensorFlow playground I just took a random example here so you know your errors might be you know might be slightly different meaning a sense that you know the errors you might get with this is the same data I just trained two different classifiers and their errors as you can see are you know slightly off now I don't know whether you know if I had you know one particular case here so you can see that there's a connection here whereas there's a connection between the blue here for example right so these guys I might get you know very different results between these two models they almost had the same training error and testing error and now how do I know I mean in this particular case we can visualize it right but in the real cases we probably cannot now how do I know which which actually is this case so all of these complexities are there when you're training a model and further you know even even there are further more of them there are lots of combinations we don't have all the time in the in the world and you know real world application so now you know on Kaggle and other you know competition forums one thing is your metrics don't change you measure your accuracy in one way but if you are solving a business problem your metrics keep on changing because you know the person who told you he wants to solve a problem he has an idea of you know what you want to solve as you go forward you realize actually you know that problem has slightly changed I want to change my metric in slightly different way or for example let's say I'm going from here to Kormangal after I know maybe from here to Kormangal I take the direction of going south okay as I reach Kormangal maybe after that I have to take no maybe north south you know whatever you know I have to go bunch of different you know roads in order to reach there right so the initial metrics that you might tell you you know to go in one direction might not actually hold well when actually you go much further down the lane as well that's another thing and another important thing is you know not all errors are the same or you know are equally sensitive you know you I mean the when you're solving a business problem not all errors you know you care about in the same way some kind of errors you might care about them you know more importantly meaning you know some errors might have a much bigger ramification so how do you handle such a thing and finally you want some determinism in the system that you're training right meaning after I you know I add a lot of there are so many different I might have to change the problem slightly differently and after this moment now do I go you know do I do the same experiment I did all the way around I did I put six months to train the model now again I want to spend six more months because I got a slightly different problem so all these questions are you know open and so the so what we are trying to figure out is you know how can we understand it looks like the better and I'll probably showcase two or three different results from the academic community to see you know how we can probably take these further so some of the questions are you know what are the properties what are what are the properties learned by some of these you know linear operators why do we need hierarchy in the first place you know why conversions and pooling you know why introducing nonlinearities why do why how does learning work why does learning work you know what's the role of capacity and can we have more interpretable networks so I want to answer all of them I was initially thinking I might but I think now given the given time I might go a little faster than expected so if you want to you know I'll be always outside I'll be available outside for discussions jumping in first of all the role of hierarchy so so let me ask this question so why do we need hierarchy it's you know three o'clock in the afternoon maybe a audience question might be here I think okay the same reason the same reason an army needs an hierarchy okay any others that's a wonderful answer so what we what we really want to capture is is you know interactions between different different particles right so if you know I had deal such elements and if I had to look at all their interactions it's possibly d-square right the total number of interactions so and and in our case you know we have us you know we have large number of such you know variables possibly right so and one thing if you do is that you know this is the idea of multi-scale this is used lot in physics and other places and you know this again used a lot and you know machine learning as well so the whole idea is that you know if you have things that are if you have things that are which are which are close by you know they their interactions you know you are really you care about them a lot and things that are far off you possibly you can you can take the groups and their interactions alone as opposed to you know looking between this guy and that guy right so you are you are looking at the local information as opposed to looking at the global information of things and that therefore reduces the number of interactions you have and that is one you know one perspective to why use multi-scale which in turn you know leads to hierarchy like structures and hierarchy you know gives us good I mean less number of interactions now you know hierarchies been used a lot in in machine learning in general one is seen and like architectures you know all the way from you know I think 50s or 40s and you know the others equivalent of that from a probabilistic from framework is the Bayesian networks style of things and there are a few others like the newer scattering networks etc so another this is another interesting result here this is so okay now one of the things that we do with SVM right which is you know the yesteryear you know more successful algorithm right so what they were doing is you know we have input space you want to transform the space such that you know you can have a linear separator that will then separate them right so I have a transformation function here after this transformation this is the separator actually is just a linear and separator so originally it was a non-linear one after transformation it's it is a linear one now the question is can we learn five and that has been the you know the whole all of this time you know people have been trying to put a lot of effort into understanding you know how can we learn this fight now now the question is you know can we use this idea along with hierarchy and put them together right so and that actually was done by you know there are two groups actually who've done this recently one is Anselmi et al the where they were this is so this is from the friends and what they are here essentially doing is you know you have kernels followed by nonlinearity and they have a group averaging okay so you learn these kernels and these kernels are nothing but you know they are so they are your normal the kernels that you learn in your like SVM case and it is similar one here where they're doing with you know reproducing kernels and so if you would like to you can check this out I think I should okay now this is maybe a maybe one of the important slides so the question here is you know when we are looking at when you're looking at data the question what you're asking is you know what what is what is same right when I say two things are from the same class the question is what is same right and to do that one other so as you can see here you know I've got a you know the bird here and I am I'm just doing a translation here right so the translation here you know just forget this is assume this is all this is all zeros then you know maybe all ones maybe instead so I say okay I'll come back to that moment so essentially this is you know first is a translation where you're just moving the translating it the other one is you know where you can do translation and rotation right another kind of transformation I can have I can do is a affine transformation where I'm stretching the data where you know you can see this you know it becomes a it becomes a parallelogram but you know stretching it in certain directions right and and finally the another kind of transformation I can do is what is called projective transformation now so what one of the things that you would like to learn is you know can we learn so the current networks that we learn just as a convolutional network the only thing that I mean one of a simple convolution just to convolution and pooling layers alone only gives you translation variance okay they can only learn that you shifted the images itself so you're otherwise you'd augment the data with all other combinations in order for it to learn so what one of the things that you know people have tried out is can we learn you know this affine transformation directly and that is actually so this is just that you know the equation of affine in the interest of time I will keep moving so and this is essentially what happens in this paper okay this paper by it's by the Google DeepMind team it's called special spatial transformer network and the idea they essentially came up with this is one module that you insert which essentially what it does is now what your translation that you learned in your neural network actually was across all the all the max pooling layers that you had right or your average pool your pooling layers average pooling layers across the whole hierarchy of network is how you learned it whereas here what you do is you insert this one module and that actually learns all the affine transformations directly okay and and there are the gist of the idea is that they got a localization network followed by what's called a grid generator and then finally a sampler so each time you train the network you sort of automatically learn this you know this affine transformation during a training phase after that when you give a new input your algorithm learns how to automatically do the transformation to make it as though that originally there was no transformation at all okay so if I had an image that is affine transform you give a five that is stretched up you after this network it'll actually give you a you know proper five without any of those effects okay so that is one of the ideas now this again it also has effects this also has applications in natural language processing as well keep moving similarly so you know all of this the study of symmetry so what we're doing essentially here is you know we're looking at symmetry we are looking at you know what are important variations and what are unimportant variations whatever are in unimportant variations you know these are symmetries and you can reduce them you know by using modules that can learn I mean which which can understand these properties and by having this kind of modules what you do is you can reduce sample complexity and further you can generalize many of the algorithm that you already have meaning you know in instead of networks that only had translation you can have affine or even much more you know more sophisticated once as well so so and this so this whole area itself is called you know group theory in mathematics which studies symmetry and that itself you know they talk about you know various different kind of symmetries and what kind of symmetries you know which all classes of symmetries you want to look at and you know what invadence to each of them hold etc. now and there are some few interesting work on that end again Suresh Wenger-Subramanian had done you know they they were looking at you know pre-training unsupposed training of neural networks and they found out these are they form they are very similar they are a close up they are a close approximation to what they call shadow groups and another one is you know Pedro Domingos again so keep moving so another interesting paper was this is scattering invariant invariance scattering transform by Stefan Mala and all from equal polytechnic so what they do is that they now they lay they they essentially instead of having a neural network that we have they had they actually decompose them into wavelets and they have a hierarchy of wavelets with certain extra nonlinear operations in between okay now why do you need them so the whole idea is that you know for translation invadence you want to sort of do an averaging on top of them a possibly a global averaging but the moment you do global averaging you might lose some of the actual important features of the of the of the wavelet coefficients itself so actually what you do is you keep both the average the average data and the actual coefficients and then put together you get a network and that's what they call their scattering invariance scattering networks and so the idea essentially is that you know you've got a network or hierarchical network which essentially are learning wavelets in some sense so so okay now the two properties that the three properties that they have here one is that you know they are theirs is invariant to translation they're also invariant to invariant to deformations so meaning so if you have a character and let's say you make some local neighborhood changes like you know you stretch them in a small region etc all these things also can be captured by this network much better and they are so that is one you know one one of the works that they did so this one again they applied it for on the MNIST dataset again and they were able to get you know very close to state of art results as well but you know they have much more they're much more nicer in sense that you know we can explain actually some of the things that's happening but again not very not very easily though so sure so the last one how can you make these networks you know more interpretable one more one more other thing you that you can do that has been done again from 80s again one of the idea was that you know can you use graph algorithms to do that and the whole idea is that you put you know you have a machine learning algorithm sorry you have a deep learning model or CNN algorithm on top of that you put a graph algorithm and you try to learn using this graph as well along with your neural network so the graphs in some some in some sense store higher level information and you can you can pick and choose how they you know how they how they combine the information so for example you know here is a case of you know where you want to segment a character and so and you are so what you're doing is when you have a segmenter and then you have a recognition engine and as an interpretation graph and then finally they they they form the final result so the whole idea is that you give a lot of different hypothesis of segments and from each of these high segments you they try to recognize you know whether this part actually is a it's a class or not now you look that you look for all all you know all the possible segments and from them you try to figure out you know you know probably only one you know three characters for example let's say okay and so the okay let me let me step back a second so here what you have is you got three characters and you want to you know you want to recognize how to segment and recognize them together and how can you combine them together is the question they are trying to do so and so here's a segmentation part of it which is a graph algorithm you fuse that into the recognition engine and then finally you go up you know and then you feed that into your recognition algorithm and then further again you have another and the like glass on top of that so that's another idea that they have suggested and finally you know deformable parts so base models these are models essentially you know where you want to you want to encode some sort of so you have parts you can you can imagine for example human is you know created of parts and then you these parts can undergo transformations like you know for example I can stretch my hand etc and I want to sort of encode some of the physics of that into my model okay and so they essentially form those as graphs and then they give edge weights for the you know for the for the for each of these parts in order to say how they how they deform so actually there's some interesting work on that again which are you know that default parts actually we can be considered as CNN's or where essentially they reformulate is a a default parts based model as CNN after which you know this much more easier to actually understand actually what happens in some of them so the the big picture is that so first is you know if you theory matters especially when you know you want to solve certain problems in real world where you want to have interpretability is important for you and we spoke about you know transformations and invariance we said that you know translate for just a pure convolution and average pooling alone will only give you translation invariance and that many a time is not enough which because otherwise you have to give a lot of augmentation of data another is you know we spoke about spatial transformation works which can learn affine properties and we also spoke about scattering networks which can talk which has got translation and deformation in variances and finally we said you know for example graph algorithms and deformable parts models how they can actually help you to how they can help you to fuse higher level information into them and I think at this I will stop questions you mentioned the invariance of some operators so for instance you'd have some rotation invariant it's symmetrical in all directions is there a use in this for a collection or a more higher dimensional thing being covariant like you take the four-wavelet thing that you showed with this and this and this and this those are individually not rotation invariant but the collection of all of them if you rotate them and rotate the image so to speak so you have a rotation covariant thing and a lot of singularity theory and group theory applied to things happening in differential geometry and so on use that idea of covariance as well as invariance and I wonder if there's a use for it here okay I'm not I'm not aware of it I probably do think offline and that that's that's interesting I do want to think about that any other questions yeah hi my name is Sudarshan I'm a novice in neural network stuff I wanted to ask you two basic questions one is how do you specialize the layers and somewhere in your presentation I think you said something about random search and you know some smart random search so what is that is it like similar to annealing or something where you you know you are being smart about being random so tell me something about those two things sure so first question was how do you specialize the layers so I mean so as of now the way I mean I'm not talking about the way the ones the the ones I spoke about here they are you know specialization of layers if you're talking meaning in sense that they are doing particular kind of learning is what's happening in them but if you're talking about if you're talking about how do layers themselves specialize I mean that so why okay why one of the so when what you mean by specialization here is that you know how do they detect one particular edge for example is that is that what you mean it learns by itself it learns by itself what are the that's a whole that's the reason why it's all very interesting in the sense that it learns by itself correct so I mean the problem is that so I mean you know we wanted to learn by itself for certain kind of things but we also want that you know after what it learns is literally into interpretable meaning I want to say that you know oh this is you know this is the kind of pattern that is there so that I can then you know disable them or enable them or you know whatever it is but that is something we cannot do that is very hard to do meaning we cannot interpret them and that is a problem yeah so I mean specialism does happen so the question is that you know whether will it get learned automatically and you know the things that are automatically learned those representations are they very interpretable is a sequester and the second the part is talking about is that when you have these networks you have large number of parameters over which you are you are tuning on so one way is that one of the idea that people have been trying out is you know you search you you know you have it you know you take a region of interest that you want to look in and that region you you do random sampling and after you have done that you're localized a small region you further specialized you randomly select again in that you know narrow region again so it's like a you know hierarchical you know random search in some sense and that's one of the things that people have been trying out but there are other techniques other methods as well like for example auto ML is one framework for the doing that so where they do they use probability in order to figure out what are the right parameters itself so meaning they have frameworks I'll go in there automatically we'll learn what the right parameters are I'm sorry we're going to have to stop the session please take this discussion offline to