 When we train a neural network, we adjust weight parameters so it can usually do just one task and do it very well. This can be classifying images or to perform object detection or determining the type of audio sampled. It could be any of these, but only any one of these tasks. Wouldn't it be cool to have one network to do everything? The same network that takes speech input and converts it into text. The same network that takes an image input and is able to recognize objects in the image. Or the same network that takes English text and is able to translate it to French or German and so much more. In this video, we're going to take a look at the crux of such a multitask network and see what results we can get out of it. We'll follow the mechanism used by Google in their 2017 paper, One Model to Learn Them All. From the highest level, we are going to train the network on eight separate tasks. The first is speech detect synthesis. This can be done with the WSJ speech corpus. Second is we'll perform visual object detection that is determining an object in an image using the ImageNet dataset. Then we'll construct captions from the corresponding images by training it with the Cocoa dataset. Parsing image from a text input so as to get the part of speech constituents and word tokens. This can be done with Wall Street Journal's parsing dataset. Next, we can use WMT's English-German translation corpus for hint-hint, English-to-German translation. Then we can perform reverse German-to-English translation, which uses the same corpus. And we do the same for French, so that's English-to-French translation using the WMT English-French translation corpus. And the final task would be French-to-English translation using the same corpus. Now let's talk about multi-model architecture. The multitasking model consists of four main components. The first is modality networks. This converts the input to a universal input representation. Then we have an encoder which processes the input. We have an IO mixer which encodes the input with the previous outputs. Then we have an autoregressive decoder which processes the input and mixture to generate some output. I'll explain these components in detail starting with modality nets. Different tasks in a network require inputs to have different types and also different input sizes. To accommodate for this difference, multi-task networks incorporate modality nets and convert the original features to a universal feature representation. This is done differently for different modalities, modalities like text, speech and image inputs. Such multi-task networks also require the modality net to transform the output of the network in universal representation to the original form. These modality nets are designed to be computationally minimal and hence it performs a transformation and leaves the rest of the processing to the crux of the multitasking network. Modality nets are designed for every domain and not every task. So instead of dealing with different modality nets for say image captioning and an image classification problem, since they have the same type of input domain that is they're dealing with input images, they can be passed through the same modality net. This enhances generalization and allows new tasks to be added on the fly without much change to the overall network. Modality nets are designed to create different size universal representations for different domains. Having a fixed size representation can hinder performance anyways. I'll talk about the types of modality nets in a bit but first let me explain the other parts of this network architecture. Now we have the encoder, the mixer and the decoder and these are made up of basically three major fundamental components. Depth-wise separable convolution units, attention mechanisms and sparsely gated mixture of experts. Each of these topics warrant their own video but I'll keep it brief for this simple explanation. Let's start with depth-wise separable convolutions. They were introduced in an architecture for neural networks called exception. Basically it's a faster method of performing convolution with significantly less multiplication operations. This becomes more important while dealing with large networks with billions of parameters. Essentially depth-wise separable convolution is performed in two phases, a depth-wise convolution phase followed by a point-wise convolution phase. A point-wise convolution applies different filters to each individual input channel, very different from standard convolution which applies the convolution through all channels. Then we have point-wise convolution which is a one cross one convolution to create a linear combination of outputs of depth-wise convolution. The ratio of the number of multiplication operations between the standard convolution and this two-step depth-wise separable convolution becomes significant for larger networks. Now for more details on this little derivation on screen and also on useful places where this is used, check out my video on depth-wise separable convolution. Now let's move on to attention mechanisms. Attention mechanisms found in neural networks is somewhat similar to that found in humans. They focus in high resolution on certain parts of an input while the rest of the input is in low resolution or blurred. They find their applications in neural machine translation to translate one language to another. In attention generative adversarial networks, Microsoft's attention gain generates images from sample text, for example. They can also be used with recurrent neural networks to easily generate answers to questions in a book. If you want to know more about attention in the context of neural networks, check out my video in the info card at the top. Next we have sparsely gated mixture of experts. A big problem with neural networks in general is the amount of training data required to get decent results. Companies like Google and Facebook have this kind of data. However, there is another problem. While training neural networks after every sample or every batch of samples, the entire neural network is updated despite the fact that neurons may not have been affected. Furthermore, the larger the neural network, more is the processing power required for training. This was until Google's paper on the mixture of experts layer. An MOE layer consists of a number of experts. And each of these are neural networks specialized in handling specific aspects of a given task. The gating layer in an MOE layer takes the input X and determines which set of experts to consult. Weighing in on the importance through gx sub i for the ith expert. In this figure, the gating network takes the input X and consults primarily experts 2 and expert n-1. The output is the weighted arithmetic mean of the output of these experts. In the MOE block of our current step, we have a pool of 240 experts trained on the eight tasks jointly. And they use 60 experts while training each problem separately. Like I mentioned before, the encoder and decoder are constructed from convolution units, attention mechanisms, and sparsely gated MOEs. The encoder has six convolution blocks and a mixture of experts layer. The IO mixer contains two attention convolution blocks. And the decoder contains four attention and convolution blocks. Now let's jump back to modality nets and discuss its architecture. To solve that eight tasks, we require four modality nets, one for each type of input. Language that is text, input, audio, and just categorical data. With language modality nets, we convert text to a universal internal notation. Input text is tokenized into constituent subwords. Considering an 8000 subword dictionary, the subword vector is then encoded. On the output end, the modality net takes the decoded output from the multi-model neural network and performs a softmax operation to output the probability distribution that determines the most likely subword spoken. By the way, subwords can be like monophones, triphones, or syllables. Now with image modality nets, we convert images to a universal internal representation. Specifically, we increase the input depth using residual convolution blocks. More specifically, to the input, we apply F3 cross 3 filters to the input image X and get the convolved C1. We repeat the same convolution step and then apply max pooling with a 3 cross 3 window and a stride of 2. This pooled output is added to the 1 cross 1 convolution on the input. The two 3 cross 3 convolutions, max pooling, and 1 cross 1 convolution constitute the residual convolution block. Convres, as we see here. You can see it in action as we convert the image X into an internal universal representation, image modality in. The difference is that, well, the representation is very deep. D here is taken to be 1024. First we have a convolution on 32 3 cross 3 filters, followed by 64 convolutions on 3 cross 3 filters. Then a set of 2 3 cross 3 convolutions, max pooling, and 1 cross 1 convolutions is performed thrice, as I just explained in the residual convolution block. This gives rise to a feature column, depth D. In the case of an image, we only need to convert the input into a universal internal representation. The output is usually not an image, but more of a category, and hence we don't require an image modality out unit. This brings us to the next type of modality net, categorical modality. This is used when the input is an image or audio and the output is a category. Much like the case of determining an input image is a dog or a cat, or in the context of audio input it could involve determining the type of urban noise the audio clip corresponds, children playing or the sound of an engine or the sound of a jackhammer, like so. This involves performing 2 steps of convolution, H1 and H2, performing pooling with a 3 cross 3 window with stride 2. Then we add the result to the convolution of 3 cross 3 filter on input X with a stride of 2. With this new result H3, apply 1536 3 cross 3 kernels to get the output volume H4. Similarly, apply 2048 3 cross 3 filters on H4 to get H5. Then apply Rayleigh activation, which doesn't change the shape of H5. We then down sample the features with global average pooling to get H6. And then we finally apply point-wise convolution on the kernel corresponding to class weights to get an output. Now let's talk about audio modality nets. The one-dimensional raw audio can be converted into a two-dimensional spectrogram. This is basically a 2D plot of time on the x-axis and frequency on the y-axis. The spectrogram can be treated as an image input and passed through the image modality net, like I talked about before. So we got the entire network architecture thing out of the way. That's good. Now let's try to answer a few questions about performance. Question one, how far is the multi-model trained on 8 tasks simultaneously from the state-of-the-art results? Here is a table comparing the multi-model implementation in TensorFlow with the state-of-the-art for 3 tasks. Sure, they are slightly lower, but for multi-tasking model, the results are not too shabby. Question two, how does training on 8 tasks simultaneously compare to training on each task separately? The multi-model performs on par for certain tasks, but performs even better on tasks where less data is available, as is in the case of parsing. Question three, how do the different computational blocks discussed before influence the different tasks? To answer this question, we compare the performance of multi-model neural network, the network without the mixture of experts and the network without the attention mechanism. Since MOE and attention help in neural machine translation, let's check out the performance of English to French translation and see how it's affected. We'll also include the comparison with ImageNet. Interestingly, note the removal of these blocks didn't affect the performance of these tasks specifically. So why is that result interesting? This shows that inclusion of even useless components in a network for a specific problem does not decrease the performance for that specific problem. So we can add components to this multi-model to improve the performance of task A without hurting the performance on another task B. So here's a few points to remember from this video. It is possible to design a multi-model neural network capable of performing different tasks. The multi-model architecture has four basic components. Modality nets to convert different input types to a universal internal representation, an encoder to process inputs, a mixer to encode inputs with previous outputs, and a decoder to generate outputs. The encoder, mixer, and decoder are constructed based on the problems to be solved. For this specific eight-task problem, they are made up of three basic components. Depth-wise separable convolution blocks, attention mechanisms, and sparsely gated mixture of experts. And finally, the inclusion of components to this multi-model architecture does not adversely affect the performance of tasks that don't use those components. And that's all I have for you now. The paper One Model to Learn Them All shows it is certainly possible to construct a single neural network to solve problems in different domains. It'll be interesting to see how this pans out in the future. We should be seeing some better results pretty soon. I've linked the main paper along with useful resources in the description down below. Thanks for stopping by today, and if you liked this video, hit that like button. If you really liked this video, hit that subscribe button. And if you just love me, just hit that bell right next to the subscribe button. And I'll see you in the next one.