 Welcome back and for the session, I know it is a session after lunch, so it will be difficult to keep you all awake, but let us try, so that is why we kept a very jazzy kind of topic and hopefully it will attract attention and you will stay awake, so we have been looking at capital networks for some time and trying to see if we can get a scalable implementation of the same and also make it applicable for text analytics, so that is what we will be kind of discussing in this session. So this work was done when I was in publicist with him and of course now I moved on to Walmart labs, have been a speaker in number of conferences and all that, so even for telephone, but first time in ODSE and Abhishek is our star data scientist and been doing lot of good work, so alright, so let us move on, so of course the text analytics in general is applicable in number of context and where for example, even in social media, you need to organize the trends or look at how campaigns are performed or you need to analyze the lot of unstructured data around and all that, similarly you need to have to look at different forms of data as well and how do you do that and could be in form of number of learning problems that come up, could be with respect to how do you segment the audience or could be how do you categorize the text or tagging and various forms that come in and so the approaches that I am going to talk about will cover lot of these problems. So the idea is to give a quick view of the history of NLP and where it came from and what upwards have been tried, so of course there are three broad layers or one can call it or curves as we will see and in general the syntactic layer is the one which deals with the text as just symbols and nothing beyond symbols and then trying to see how they can be well formed and doing simpler problems like for example, fast tagging or text chunking there and of course the semantic layer is when you start understanding the meaning of the text and starting to look at it from that sense. So little more sophisticated problems are solved in this layer which is the NER or things like that and then of course the pragmatic layer is probably the most sophisticated one of them all and where you look at contextual information and we will do very complex tasks out of that. We will come back to this in a second but just look at different approaches that have been tried. So starting from rule based simple approaches to very complex things people have tried different things and in fact the OWL has been tried, the RDF based things then you had the of course the fundamental things were of course when machine learning and deep learning started getting applied to text analytics that is when it kind of really got to the next level and you know people started getting real breakthroughs there right, so something like this right. So essentially right the curves, so you have to cross the curves to actually get real breakthroughs and so that is how this kind of picture depicts it. So we did see the different layers right but the same is kind of put in the form of curves here and the key thing is of course the kind of task you do in those curves would be quite different but the broad thing is the syntactic curves is when you are looking at only as a bag of characters and nothing beyond that are a bag of words and in the syntactic in the semantic curve right you start looking at it as more you know in terms of bag of concepts and understanding the basic concepts of the text and things to there is some semantic understanding of the text and of course in the pragmatics curve your the whole representation changes into you know not just bag of words or something but it becomes narrative based representations and things like that. So we are not yet there of course the pragmatics curve is more of a futuristic thing but we are somewhere in the middle of the semantics curve in my view though there is been lot of advancements in the recent past especially with Siri and all of the Google's work and all that right but we are still not even probably touching the semantics curve and little bit into it but not beyond that much right. Yeah little bit more explanation of the same but again right the key is in the semantic curve right there are two broad forms one is the entogenous NLP where you only look at the internal structure of the text and try and understand it but when you start looking at external knowledge for the same right that becomes your taxonomic NLP and you start building taxonomies and things like that out of that and of course the noetic one is the connectionist system which is your neural net based systems and of course in the pragmatic curve there are computational models of narratives and all its based on narratives without much ado right we will step into the topic of the day and so when machine learning start getting applied to NLP task right the key thing that was happening is of course people had to do lot of handcrafted features and then do task specific features as well right and then they were able to extract lot of value out of the text and be able to do lot of tasks especially categorization or any of the tasks we discussed right they were able to do fairly well but the thing is because it requires lot of handcrafted features and it is not a very easy thing right so consequently people started applying the deep learning or the neural net based approaches where you avoid too much task specific features again you do not need to handcraft too many features and not too much preprocessing is required and may achieve better generalization of course this is I have to take it with the pinch of salt because we know that in certain cases even deep learning do not get good generalization and yeah and of course the some of the models that you are the pre-trained models right they tend to be acting as you know feature extractors as well right. So, a simple or a complicated review right like this is not so easy to parse for the machine though for us it is very apparent what is being said here but machine kind of struggles to understand and especially right if we start with the traditional bag of words kind of approach and just look at the words and the t of id of count that normally done right then you might not interpret the review correctly so obviously you need sophisticated ways of looking at it word embedding comes in here and little bit of context is looked at because it is not just the words in isolation but in what context word is occurred and you know it looks at that and gets you a better embedding of the words which is your standard models like low or fast text and now of course you have BERT and open GPT and what not right but all of them are in some sense trying to get you an encoding of the text so that then your machine learning can kick in and start using the numbers and all that right. Of course it is still you know though we have talked about embedding and all that it is still trying to treat it as a bag of words so but how do we move it beyond that right so I think that is where some of the deep learning or the neural net based approaches come in for example in the CNN right one of the things is you are able to for instance look at the word and an embedding for every word but when you kind of do the filtering operation right what you are trying to do is look at words together for example the n grams right probably one or two words together and then 3, 4 and so on you will have varying number of filter widths and that kind of represents the n grams and then you come to something called a feature map in the middle layer which is a broad set of features and then the max pooling layer will try and look at the signal strength and all that and figure out which is kind of maximal and take only those signals which are influencing the for example the sentiments right for example for a sentiment analysis test right it will get to the set of n grams which are very influential and which are kind of determine the whole sentiment and that will happen in a CNN very effectively so if we kind of look at other approaches have been tried so it could be not necessarily at the word level but even character level the same could be tried right where not even words but you start looking at characters and doing similar approaches right again with convolutions and max pooling and you know fully connected layers so as opposed to this right the say one of the key issues which the LSTMs tackle is the fact that the order of words can be interchanged and it will still be you know working well for example you can say cats climb trees but you do not say trees climb cats right but if the order is interchanged the traditional approaches might not work well so the LSTMs will be able to address those kind of challenges as well and there are variations within that as well because it tends to treat it as a sequence of words and you know be able to understand semantically right what it is trying to indicate and perform the classification or whatever task you need to be doing but in general right the broad level people have found that whether it is CNNs or LSTMs right for a lot of tasks both perform fairly well and but CNNs tend to outperform LSTMs in most context but primarily the thing is everything depends on the kind of sentences given and all that but primarily right if the few words determine the for example the classification right that that is when the CNNs tend to perform very well and you know probably even outperform the LSTMs but there are long range dependencies right like the sentence we saw is very complicated right possible that the LSTMs would outperform CNNs in those very specific cases but as a general thing right if you look at the way neural nets work and the fundamental approach of neural nets right so one way is to say that you know the because you have a whole lot of complexity to deal with right especially when you start talking about perturbations and you know there are patterns different kinds of transformation that you need to address and things like that right then you tend to look at there are only two possibilities right one is you look at kernels which have very large dimensions and consequently the number of parameters increases a lot you have huge number of parameters to learn or the other way is to say that no instead of that we will increase the data and start looking at larger data sets and all that right so now the issue is primarily because the max pooling actually loses information which is what you know we kind of tend to say that for instance in this case right max pooling tends to look at max in a set of cells and just gives that value which is great right when like I said right when there are influential few terms which influence the whole result and that is when you know these in the max pooling layer will be good and you get great performance but it actually loses a lot of spatial information and yeah Hinton has said the fact that it even works is a great thing yeah but it is not a great idea and of course the challenges are that see for simpler task the classification other task right you are fine with invariance which is the convolutional networks that are invariant in the sense that when the input changes right the output does not change but for example when there are small perturbations right then it becomes difficult to handle those but you should strive for equivariance where if the input changes slightly right then the output changes and you are able to actually model the change in input and take care of that so that is the essential difference between what we have been seeing all along right the CNNs and other forms of deep learning networks versus capsules because the deep learning typically are invariant but capsules are the equivalent and we will see for example under different conditions we have seen that right in the number of cases where the convolution networks are very sensitive to image perturbations rotate the image slightly add noise to it those simple things right then your CNNs performance will just go for a toss but then as opposed to this what we need is equivariance and consequently you know we move into the world of capsules but just to summarize right the different approaches that people have tried and where they perform well and all that right of course the word embeddings are there and which is of course still treating as a bag of words but though you get reasonable context in the sense that you are able to figure out where the words in what context they are occurring and what is their meaning and all that but you are still treating them as a set of words right the CNNs tend to move away from that into the semantics curve but then the challenge there is of course the long range dependencies are difficult to handle and of course the LSTM's tend to handle the long range dependencies but then the challenge there is becomes very computationally very intensive and there is still little spatial invariance which you know creates difficulties so as we will see right the where capsule networks comes in is precisely this point that the spatially sensitive approaches tend to have exponential inefficiencies but as opposed to that the spatial especially in sensitive approaches tend to be efficient but you cannot encode rich structure there so bit of a challenge you know what we have and that is where capsule networks come in and you know would solve the problem so let's we will talk right once we have a lot of material to cover so let's come back to your question once we are trying to finish I know it's important but let Abhishek talk about the capsules. Thanks Abhishek. So I think we talked about kind of a brief background of what were the limitations of the existing approaches especially around the typical CNN and how typically it solves is that you have your bunch of filters and then you create your feature map and then you apply your pooling layer the challenge with the pooling layer A is that you're actually losing the spatial relationships so basically let's say if you're doing a max pooling you're simply digging out one right so that's where it shifts from at how typical capsule network changes is you move from the world of scalar to a move to a world of vector right so you say that okay instead of a typical pooling where we apply your operation on this particular region and get a scalar value we move this world and say that okay let's calculate a vector right so what that means is from that particular region we try to extract away a vector where different values in this vector actually represents different things right and kind of we kind of a preserve multiple things that could include multiple properties let's say that if you're dealing with the images these vectors can actually represent let's say color can be thickness or different other things or if you're working in the world of text then possibly different this capsule can actually encode your morphology your semantics and other things right so the idea is you're actually not losing information just by taking a max value right so you're trying to essentially represent the entire feature map into a set of values and assuming that this set of values will encode all of these properties right and another unique prop another unique property of capsules are there is actually this vector is very unique very important because a is that it kind of encodes all of the values and the properties also known as instantiation parameters but also the vector length itself actually encodes the probability for example whether it's a negative sentiment or not whether it's probably this particular class belongs to certain certain class or not right so we will use all of these notions essentially if you see what exactly capsule possibly this slide will will be the good summary so it contains it nothing but a vector right which encodes multiple properties which we call as a force and then you also have your length of the vector itself which simply defines the probability of a certain class that's it right now typically what happens in a typical neural network you have your different layers where the neurons are kind of a stag and connected from first layer to another layer same thing happens even in the capsule networks as well but now instead of the simple value now you have capsules right and every capsule itself is a vector right so how the forward computation happens so if you recall how typically happens in a typical neural network that you have your bunch of values scalar values you apply certain probably weights and then you do a weighted sum apply your nonlinearity function could be a Rayleigh or something it's a slightly modified approach here because now instead of a single value now you have a k dimensional vector so when I say k k means basically what is how many instantiation parameters that you'd like to use or to encode the properties and then what we do is we actually performs another operation which we don't do in a normal regular network which is an affine transformation so essentially what is nothing but multiplication of the input with a certain matrix right so now it's a matrix calculation and then what we do is now we simply multiply that value with a coefficient which in other case it's a weight right so we use the word actually coupling coefficient it is very crucial in the in the world of a capsule networks and then you take a weighted sum just like you do earlier and then you remember that what we do is a typical relu function and all of those knowledge of function that we apply here what we do use is a kind of very novel scratching function it's a very actually intelligent approach what it does is actually it takes the vector because now you're dealing with the vectors but it actually doesn't change the direction but it change alters the length right and makes it between 0 and 1 so the the benefit is that you are always dealing with the probabilities right because probability we know that it should be then the scale of 0 1 so what happens is towards the final output you directly get this value a kind of a squash between 0 and 1 so and essentially what happens is now you have to do a training process so for any real networks we know that how the forward computation happens and then you use your typical back propagation to your train this right in a capsule you don't use the probably typical back propagation technique rather they use another technique what we call as a dynamic routing so where we try to adjust these coupling coefficient the let's let's take a just a quick snapshot how that happens let's assume that you have a lower level capsule probably the lower layer and then you have a higher layer let's say that there are two capsules so what you do is you try to compute the forward computation just like we did it's nothing but your affine transformation and then then what you do for all the lower level capsules you try to come up with what are my different final predictions these are my prediction vectors and then you say that okay if probably more of the lower level capsules are actually similar then probably you give more weightage so I'll give you a quick example let's say that you are trying to build a facial recognition right now if the lower level capsules are let's say nose eyes and all of those and higher level capsules are your face itself and then if you say that okay more of the lower level capsules are says that okay if eyes are in a certain position and the nose at a certain position then actually it's a person otherwise it's not right so you can think that's a higher level capsule as the face itself and the lower level capsules are these lower level constructs so so primarily just it is similar to what happens in a typical neural network so you have what happens in a typical deep network is that your lower in initial layers are kind of lower level features like just like your contours or the edges and then you try to merge all of those lower level constructs to create higher level constructs this is exactly the same okay so these are my lower level constructs and these are my higher level constructs the the only point I want to make it is unlike the traditional training process here the training process is slightly different and it uses the routing by agreement algorithm okay and so what happens is if let's say that more of the lower level capsules kind of an agree that okay these are similar in nature then we give more weightage otherwise we give less weightage that's it okay and then internally it also apply lots of techniques like softmax routing and all kind of we kind of exclude all of those this is probably a detailed algorithm so on a high level let's see that how it typically works so you still have your image right you still have your convolution operation that you normally do to extract your features but now after these convolutions you try to apply your capsule layers so this is the middle layer and then the subsequent layers is your final digit capsules and then you can move ahead with the different layers right and then the training happens there by the routing by agreement so the benefit is that unlike the traditional CNN where you get probably don't get viewpoint invariance you also only get the in typical invariance what that means is even if let's say that you have images from multiple angles the CNN will actually fail same but in the case of capsules but the way it tries to solve the problem is try to save the entire hierarchy right how the entire object was created that's why it is actually able to recognize from any angle right so it was tested on the knob data set and here you can see that even from the different view angles the actually performance was really great right same philosophy actually works if you deal with the text right so what we did was primarily around capsule network when it came the paper came by Jeffrey Hinton himself and then primarily on the image side of it and then obviously people started to see is the value of capsules for other activities and the text was obviously is an important consideration so if you see that how we actually you can leverage the same philosophy of capsules but apply it on text so obviously there were lots of papers that started popping up and we kind of picked up a few of the papers probably just to explain that how it works and the core idea is remains the same so you have your actual text and the embedding space and then you take your n grams because now you're dealing with the text not images and then what you do is you try to create these convolutional capsules and then you kind of you have multiple such layers which kind of a trains over the field right over the different iterations and then you have your final flattening and then finally you fully connected capsule so there are few other things that was done in the architecture I have given some link probably if you're more interested in knowing the how exactly this architecture works we kind of a things lots of things to cover probably we'll kind of jump ahead but on a very high level there are few other things that was added especially for dealing with the text right so what happens is in a particular images right images are fine but in the text you may have lots of noise also especially let's say that you have lots of stop words and lots of words which are not actually required right so I'm sure that if you have ever worked with the text analytics for some of the activities that you do as a part of your initial preprocessing is removed stop words and do blah blah blah right it kind of handles all of those if you can do that so the idea was in the previous architecture in the final layer what they did was actually created another category they call that as an orphan category so what that means is and the assumption is that all of the other routing will happen or for the background noise to the orphan category right that actually resulted in a very high performance and we'll see some of the performance just a bit and the other thing was they also kind of a tweak the final softmax also so in a typical softmax you try to make sure that okay it is between 0 and 1 here what they do did was they used a special variation of it what we call as a leaky softmax primarily so that that anything which is going into orphan category we don't apply lots of computation so this is again from the computational enhancement site and then this is what the final architecture looked like and it was actually we also tested on the multiple kind of a data set right from like reviews data set sentiment data set and then you can actually use it for different classification and there was standard data set that it was tested on and there were a couple of variations right so first variation is what you call as in capsule a where you have your initial text and then you apply your final embedding and then you have your convolutions primary cap convolutional layers and then going through this layer and then other variation where they try to use 3 grams 4 grams and 5 grams and kind of a combined and finally taking an average right and if you see the final the kind of how it actually performed with respect to other traditional techniques like LSTM's and CNN's you can see that obviously the capsules actually out performed all of this right but actually the biggest benefit is not just this because here you have algorithm which is actually able to solve right and probably the numbers may not be that much great because you already reached around 88 percent of accuracy even with the CNN the biggest benefit was when we apply the same algorithm but applied on the multi label classification so what you call as a multi label so instead of a single label now you actually predict multiple labels especially let's say that if you are doing or tagging right so you have a certain text and you apply what are the different tags that applies to that particular text right if you have that kind of a problem so now what you need to do is you may have to apply multiple labels associate multiple labels for a particular text right now that is a challenging task because what happens on your label space it actually increases from end to to the power end right because you can have any combination right and if you are using the traditional way of let's say CNN and LSTM what happens is actually you require a lot of data to actually build these kind of model with a reasonably high accuracy however with the help of capsules actually even with the same data it was not actually trained on the multi label but you trained on the single label itself and kind of a predict that what will be the multiple labels then also you can see the performance was quite high right so that was another benefit that you can leverage when you are doing with the capsules there were multiple variations that came out and people tried different techniques trying with the different kind of layers so instead of so there was another variation where they tried and elugate right the expansion linear units and again use and again performance was good enough another thing that was tested was now you have one particular task to handle right so you're doing once a classification so the idea was can you take the same task say one model for performing one task and apply it across different tasks right what you call as a multi task learning right so here what we did was kind of again trying to leverage all of these networks and see that can we apply a model that was trained for sentiment classification and apply it for let's say categorization right and what is the benefit that we get right so again like I mentioned so if you see that still we are living in a world where we are dealing with what you call as a strong AI means that if you are building a system which is meant for one particular purpose it will not actually perform well on another purpose again this is not exactly a strong AI but still we are in a realm of even in the text right you have multiple kinds of tasks you have your sentiment classification you have categorization and all so can you build one model and apply across multiple things was the key here now we talked about this was more on the academia side of it right so obviously a lot of bunch of papers and we tried lots of code but when you actually apply in industrial settings what are the challenges that we get right so if you are working on any product or any industry probably in a bigger bigger data set and also what we feel is okay algorithm is something which is paramount but actually it's not right in a bigger scheme of things if you have to actually build your entire machine learning pipeline model is just only one particular portion right so you may have the entire pipeline that you may have to build even before you realize the value from your investment right so you may have a different data processing logic and then you may want to build your model and maybe train at scale because not also not always you want to train on your maybe laptop right because you're dealing with a large volume of the data and how you actually create a distributed training and how you actually keep track of multiple experiments that you're running right and then at the end of the day you may have to serve it right at the end of the day you expose an endpoint that can be consumed by other services right so that's why what we did was we started exploring Q flow I'm not sure if you have worked with Q flow but this is one of the awesome tool to actually establish the entire workflow right so what exactly this is is this is a machine learning toolkit running on Kubernetes so Kubernetes is already being used for all of the production workloads so here what you're doing is can we run a machine learning load on Kubernetes right and it kind of gives you lots of toolkit to deal with this kind of scenario so you can start working with the typical Jupyter notebook which is your familiar ground but let's say that now you have created your script and notebooks are working fine now I want to do something probably bigger which that means could be you want to do a distributed training right where you have a data shorted across multiple partitions and then you want your parameter server and multiple workers to train right if you want to do that so actually you can leverage TF jobs and we'll just see a demo also and then you can actually serve it right serve what that means that you can actually extract your final model object expose as an API and then use it across and then you may have you can actually build the entire pipeline where you may have certain initial raw data performing certain data processing logic and then actually deploying it and then maybe you may have certain matrices that you want to explore right so you can build your entire pipeline you can actually track your pipeline so what that means is if you are dealing with let's say different hyperparameters or different data sets different features you want to keep a track that okay what kind of features set actually with a kind of hyperparameters gave me the best result right because one of the key challenge that happens especially in racial settings is if you're working in a team probably and and if you're doing lots of iteration which is very common if you're dealing with some of the latest kind of a techniques because you want to play with it right you want to see that what are different variations actually result result in a better performance so what you can actually do is you can keep track of it right and you can actually compare against different experiment it also has lots of other integration kind of skip that but if you're building a CICD pipeline then you can actually leverage Argo so especially it is used for machine learning pipeline and CICD integration so I'll just quickly show probably a summary of so what we did was we took the capsule capsule networks and we kind of a modified and so that because we need to run in a scaling mode right so we deployed that on a Q-flow and then some of the times you may also want to leverage let's say multiple GPU right now out of the box it's very actually difficult if you want to actually leverage multiple GPUs so what we did was we kind of a leverage mirrored strategies approach and I'll show how we can do that and and then we also leverage to it okay so instead of multiple GPUs if can I train on multiple nodes right so maybe you can do the entire horizontal scaling and that's where distributed job comes in and then hyper parameter tuning so hyper parameter tuning we say it is a very common task but look at the challenge let's say that if you're working on a large volume of the data right and the since the data volume is too big right and even if with one particular set of hyper parameter it takes let's say few hours right and if you're doing a typical grid search kind of a technique what will happen is this will take ages because every combination will take its own time what cat it does is it kind of a runs everything in parallel it runs in a typical Kubernetes part and I'll show and you can actually compare against those so let's quickly jump and I'll show the working piece of it and all of the code is either way we have already shared on GitHub probably I'll share the link towards the end so if you set up a queue flow you can actually get this multiple things so you have your notebooks environment maybe I'll start from here so notebooks environment where you can actually create your notebook servers and you can specify okay what kind of image I want to use because have everything here is a containers right so what you're doing is you're taking your code all of your dependencies and putting into one containers so basically what we're doing is pointing it to the right docker containers and then specify okay vomit CPUs memory that I need you can actually have a volume snapshot also so what that means let's say that if you're working in a team and then one person actually start only is good with data exploration for example right so heroes the few notebook lines right and then you see saves the snapshot and another person who is good with the modeling he can take from that snapshot and he will get all the data as well as all the environment and the person can start from there right so you can have the different persistent volumes you can actually attach and if you are let's say you have GPUs or something you can actually configure all you have to say is I need to GPUs and all right obviously GPUs should be added to your cluster okay so that's one another thing I wanted to show is around so this is the notebook environment and probably you can run through it so how the entire capsule networks work so I have and if you are running on a GPUs you can actually leverage mirrored strategy so you can configure okay I want to beat GPUs for my TensorFlow job to run right so this will take care of the multi GPU strategy if you are dealing with I'll show couple of more things so this is your TensorFlow job so here you can see that I have shared multiple variations so for example I want to run in a distributed fashion with one parameter server with three nodes you can simply specify I want one parameter server and three workers and it will automatically scale and run across and you can actually dig deeper that what is happening on individual worker okay and then if you want you can actually also have this is your cat if so cat if is primarily your hyper parameters so here you can see that every line actually represents one hyper parameter and you can actually compare against okay this is what let's say what what performance what actually gave me the best accuracy what kind of hyper parameters right and you can actually compare against you can run multiple variations of it and it also supports out of the box different kind of an optimization algorithm so grid search is one you can actually have Bayesian search also you can simply replace by grid by Bayesian it will actually apply Bayesian search okay so coming back to kind of a quick wrapper wrap up of what we have just done here is these are some more basic implementations you can check out distribution strategies we already talked about this was the hyper parameter tuning so I think probably is the last slide so what we are further doing is we are also trying to see that what are the other efficient algorithms and we saw that how we are actually representing the vector right so we are taking replacing the scalar with a vector assuming that the vector will encode all of the parameters now in fact what has happened is a matrix capsules have come so instead of from scalar to vector now you can actually bet your entire matrix now right so what that means the matrix will encode all of the properties and actually those are really helpful to solve some of the very complicated text analytics challenges so kind of this is thanks everyone so I think all of the slides are already there if you are actually more interested in to clue itself we presented in starter data conference as well and the links are there and feel free to reach out on LinkedIn or any questions sorry hi interesting talk so two questions one is so you mentioned a bit about Bert and the transformer networks but I didn't quite see a comparison over here on your tables so just curious about if you had tried anything and how they compare that's one two is what were your I also didn't see too much about real real world applications inside Walmart labs for instance of capsule networks so I work in a very close space and I work with transformers daily so just sort of wondering if you know if I have to make a switch then I'm just looking for evidence here in an industrialized setting I'll take the very first one so if you see the encoding right so even in the architecture itself you see from any of these capsules so first company itself is actually encoding right so the first one itself right so what that mean is you can actually leverage any of your encoding techniques so you're not getting away from that right so the as we presented right so encoding is just a representation of one word here what we're trying to do is from the word encoding how you can actually model the spatial relationships between these embeddings right right right right right okay fine I think maybe we can probably I think we have not done that particular experiments as of then that's like Apple store and just compare that's not a fair comparison right because here we're trying to do is the techniques which are actually trying to model the relationship and there are what are the different techniques and we are trying to see if capsule fails from them or not the only point is those could be used orthogonal to the technique yes and you can use that embedding in all of these as well yes so you can't compare against a specific embedding technique versus actual architecture right so that's the difference so you can replace you let's say what you like probably with other embedding technique with other another right here what you're saying is can we try changing the entire architecture itself or not and and with respect to the applications in Walmart labs or anywhere else right I think there are enough applications of text analytics I think text analytics is pretty common and people use it everywhere right even in Walmart labs we use a lot of text analytics but we we are experimenting right to see if capsules are useful or not and we are still in that stage so I'm not going to say that you know we have used capsule networks extensively and all that no we are experimenting and hopefully in we've identified couple of pockets especially in text where you know we look at extracting of attributes from the text and all that right those kind of problems we are where we treat it as a classification kind of problem right there it's likely that capsules is going to outperform and we have just started experimenting to see if they can actually outperform the traditional ways of and we have built all you know transformers out you know encoders and everything now we are building a capsule base solution and then comparing it so probably next time we talk I may have more updates on that yes yes yes yeah absolutely no no both see some of these could be done right so in certain contexts there are a lot of experiments being done so the broader question was is it applicable in a you know Walmart kind of context and how much of it is being productionized and all that there it's mostly experimental because this is still in the industry not yet widely used right so we are trying to do it yeah obviously experimental stage yeah yeah yeah so it's it's already there so entire code and everything is a we have put it up right so you can click on the go to the first link and it we have shared the entire steps how you can actually replicate and queue flow and the whole link is there so you can go and look at it and experiment with it yeah good multi-label one right yes yes right so I think we shared the details also for that so yeah yeah yes right so currently what is happening is for just for the demo right here we have you can actually have your data in your let's say maybe in HDF is in cluster you can actually point it to that or you may have your data itself let's just like because that's how typically stored in a cloud storage right so you may have like already files for a different dates or something and you can actually point that to okay this is my parameter server which will distribute among the different workers to work on a smaller files and kind of an aggregates all of the parameter updates into a single parameter yes that's what the job actually takes care of it right that depends on give how the data you have created the data is already like probably partitioned like that you can actually have success can you just briefly tell about how the vector how the vector information is encoded in the capsule right so we talked about I think the first slide itself so how about how that happens is the entire one capsule is nothing but a vector right so what that means let's say that one particular value actually represents length one particular value let's say a stroke thickness or in this case it could be any other in case of text it could be let's say different morphology or different semantics it can capture and assume that probably during the training process everything gets updated right so so the idea you have to specify how many station parameters that you need that's a probably a hyper parameter for you right typically it's the papers that we used again in 16 was good enough number yes now I think we have tested I think it worked really well yes so basically TF job I think it provides that yeah I think maybe we can connect offline I think there are still questions people feel free to