 So I'm going to share my topics, which is about auto-intelligence. And hi, I'm actually, my name is Naoya Takahashi from Sony Corporation, Japan. And I'm very excited to be here and share my works. And Sony has actually a software center in here, Bangalore, and I have been working with them since 2011. And every two years I come to Bangalore, and I've learned that every time the new buildings are constructed, and I feel like every time I come to a different city, but which suggests that this city attracts many industry from all over the world. So I think it's really great to have ODSC here. Yes, and I'm very happy to be invited. And after this conference, I also attended another conference which will be held in Heidelabert, and that conference is on the speech processing, and which is one of the biggest one. So it also indicates that not only industry, but academia also paying more and more attention to India. So it's really great, and I'm very happy to have a good connection with people in India. Okay, so in next 40 or 45 minutes, I'm going to share my research work, which is auditory intelligence or AI. And I know the AI usually stands for artificial intelligence, but and I think that word is something about which focused on the blame. But today I'd like to foresee that about intelligence together with sensors, especially auditory sensors. So we have five senses. Somebody have six senses maybe, but usually I have six five senses. And since almost every signal is processed in blame to extract higher level information, like what he said or what this material made of or this restaurant is good or not. And since our products and research mainly focus on the visual and audio domain, let's focus on them. And nowadays in these domains, the deep learning has made a very, very good, very huge progress. For instance, the face recognition is already surpassed the human ability so they can machine can recognize more accurately than human. And smartphone app can create the smiling face from the neutral face. So everybody in the world can be happy in appreciating the smartphone. And machine also started to draw even art. So in this example, the only photo and style image like this is provided and machine can automatically generate the art, artistic, photo-like painting. And in audio domain, the automatic speech recognition also reach to human parity and it also start to play a music like this. So to generate this sound, no music score is given. So machine simply listen many, many piano sound and learn to generate un-listened, never seen music sound. So although these results looks very amazing but there are still a lot of things that machine cannot do. For instance, what machine can understand about this image? Yeah, machine can understand there is a fat woman or there is a puppy, there is maybe text in the poster but machine do not understand the story behind this and how funny it is. Or the speech recognition works sufficiently good in major languages but not good in minor languages or dialectic sound. I heard that in India is more than 100 languages spoken, that's amazing. 2000, wow. And how many of them has automatic speech recognition? So it's also good to have any type of language for speech recognition or whatever. And it is also great if machine can even surface the human auditory like for example extracting only one instrument sound from music in noisy environment. So before addressing these problems, let's consider more about the sensors by reviewing the history of evolution of life forms. So before 540 million years ago, the life forms is considered to be a very, very simple like this, so just one in the water and the eight watt crossed by, so very rich life compared to nowadays. And but in 540 million years ago, the Cambrian explosion was happened and the number of species is exponentially increased. And the reason why this phenomenon happened is still not clear but one theory proposed by Andrew Parker says that the appearance of eyesight triggered this explosion. With eyesight, the species can actually catch a play or escape from its enemy and also adapt their habitat or active time. And some species further developed the auditory sense and together with brain and it enabled them to communicate with each other and exchange information or knowledge. So it seems that extending the sensing range and is one key strategy to evolve and prosper. Now we are developing AI and I'm thinking that AI is a kind of evolution to evolve ourself by sensing the ocean of multimedia. So it made it possible to sense beyond the physical limitation like time or length or time or location. So every day many videos are uploaded on the internet and many live cameras is also connected to the internet and we still have very limited capability to sense them and make use of them. Since the machine only understands the attached transcription or the metadata which attached to the manually and machine does not understand the content itself. The auditory intelligence address this problem. Its goal is to enable machine to understand the audio and content itself and make it possible to create a new content or extract information for you or search a video for you. So with auditory intelligence, the communication between machine can be more natural and machine can understand the world as we do. Okay, so auditory intelligence looks help our life but why sound? The computer vision is also good, so is it not enough? Well, we have sound has complementary advantages. One, the first one is the sound cover wider area. Even an object is occluded or out of camera, the sound still reach to the microphone. And second, the sound is sometimes easier or faster. So many people realize that speech command is first of an easier way to control the devices. And sometimes more informative which is it contains the information of emotion or material, look at this droid. He does have any face expression but he can express his emotion. I don't know he or she but it can explain his emotion. Very fluently like this. Or we can also have some idea by hearing the hitting sound. For example, do you hit the water mirror if you buy than to see if it's sweet or not? There's also sound contains some information, right? So there are many things still we have a lot of things that to be inspected to achieve the auditory intelligence. So today I'm going to introduce my work, four works towards auditory intelligence. First I'm going to address the speech recognition for low-resource languages which is currently ASR system still struggling. And then move on to the audio event recognition for no speech signal to understand the content itself better. And then also to address the video highlight detection which needs high level understanding, seen understanding. And finally, I also address something go beyond the human auditory capability, namely source operation. So first I'd like to address the automatic speech recognition for low-resource languages. And this work is done but when I was studying at the ETH Zurich in Switzerland in 2015. Okay, so I guess most of you are not so familiar with the speech recognition. Although the first, today's first speaker actually made a great wonderful introduction. So you might also have already have some ideas but I'd like to review a little bit. So suppose we have this at our end. Pen pineapple, a boba. Then the goal of speech recognition is to recognize the sequence of words like yeah, pen, apple, pineapple. And in order to do that, we need a model like pen recognizer or pineapple recognizer or after recognizer, recognizer. But actually the number of target words is very huge. So it's not good idea to model each word independently. So instead we usually split word into subword and prepare dictionaries which maps the each subword unit sequence of subword unit to our word. And commonly used subword unit is phoneme. Like for example, after complies with three phonemes, up, left, like that. But this phoneme is motivated by the linguistic knowledge and it might not be the best option for ASR, automatic speech recognition. And also it's very time consuming and is a lot of expertise to create such a dictionary from data. Imagine that if you, in order to prepare these labor data, you have to listen all data first and they assign the labor for them. So therefore the automatic preparation or creation to such a subword unit and the corresponding dictionary is very clearly required. So here the goal is to jointly learn the subword unit and corresponding dictionary from data without any label. And actually we can find several works that try to achieve this goal but none of them can achieve the better qualities and phoneme based model. So we propose two ideas. The first idea is semi-supervised deep neural network training. And the second idea is to estimate the reliable dictionary from multiple attalances using K-Vitavian-Vitavia algorithm. So I briefly explain how the current this proposed method works. So first we extract the features from the waveform by applying small window. And any feature can be as possible like MFTC or filter bank. And after we extract the feature from every flames, we get acoustic space, something like this. And initial subword unit is obtained by clustering them and then we model each cluster using the GMM, the single actually single component GMM Gaussian model. And that is the initial acoustic model, sorry, the subword unit. And from now on we call this subword unit as the abstract acoustic element or AAE. So now we have initial subword units and acoustic model. So the next thing is to estimate the dictionary which maps the sequence of subword unit to our world. So suppose we have a pronunciation of apple and the corresponding sequence of feature is something like this. Then the dictionary is something like this. So A1, A3, A5 like this. But if other person pronounce apple, we might get different sequence of element subword unit. So in that case, what should be the pronunciation? And we estimate the subword unit by jointly maximizing the joint likelihood of multiple attalances using the K-Vitagy algorithm. But it's too technical, so I will skip. But the important thing is that we estimate the dictionary using the multiple attalances in order to get a reliable dictionary. So now we have acoustic model and dictionary. So we can actually create a built automatic speech recognition system. But the initial model is quite noisy and not reliable since we do not use any label. So we analytically estimate the acoustic model and the corresponding dictionary analytically. So first we estimate it like this and then we fix the acoustic model and learn the dictionary and then fix the dictionary and estimate the acoustic model again and again. So by doing so, the model will be more and more reliable. But the important thing is that at each relation, we increase the number of GMM to get more expressive modeling capability. And then finally, we replace GMM to DNM to further improve the modeling capability. So which enables us to model a very complex acoustic space using this technique. So we evaluated our proposed method on the isolated world recognition. And this table shows the accuracy of, I'm sorry, the world error rate. So which means that how many times the system made an error. And the upper two system is the conventional method which is phoning based method and which requires a lot of training data expertise. And the bottom two are the proposed method which data-driven so we do not use any labor data. And as you can see, the proposed AAE actually outperforms the phoning based method. And we also investigated the number of AAEs. So how many abstract elements do we have? And interestingly, we got the best result with 384 elements, which is far more than phoning. So it indicates that the data-driven method can model the acoustic space more precisely rather than manual one. Okay, so the conclusion for the first topic is that we proposed the method to model the acoustic virtual recognition without labor and it's good suitable for low resource language which we do not have many data set. But more important thing is that if we do not have the large amount of data, we can still use the deep neural network by combining the conventional method and deep neural network. Okay, so let's move on to the next topic which is about audio event recognition to which is for understanding our theme better. And this topic is actually getting more and more attention recently. And for example, the Facebook released a large dataset for this task in this year. But when we did this task, I mean this work in 2015, there are relatively limited amount of data. So we began by creating a data by collecting the data from the internet and built this dataset, audio event dataset which complies with 5,000 grids and in total 768 minutes. And at that time, this is one of the biggest dataset. We also make this publicly available for the community. And using this dataset, we built audio event recognition system. And in order to get the good performance accuracy, we propose the three methods and three ideas, large input field, deep CNN, and data augmentation. I will briefly show you these ideas. The first idea is large input field. A conventional method for automatic event recognition you typically apply very small window and extract feature from them. And then aggregate these features by using bug of word or HMM or simply averaging them and get some large field features for this. But during this process, the temporal order of blame-based feature is lost and causing considerable information loss. So we propose to model entire acoustic event which is few seconds and input it as a single input to the neural network. And it sounds very simple, but it actually a bit difficult since the dimension of input is quite large. So it is not good idea to model it with standard fully connected neural network. So instead we use deep CNN to model them. And I'm not going into detail about the deep CNN or convolutional neural network, but it has, I just mentioned that it has a transition invariant probability which is suitable for our case because the audio event can happen at any time in the audio, so it should be invariant to the time shift. And in 2015, one specific type of convolutional network called VGG is proposed by Simonian, I think. And so, and it works very well for speech recognition and image recognition. So we adapted that architecture to our program and proposed two types of architecture, A and B, which complies with seven layers and nine layers. And it's significantly deeper than previous work. Okay, so this is about the architecture and the last idea is the data augmentation. As you know, the success of deep learning is supported by another large amount of data. So it's better to have more data even we have some amount of data. So we created a new data from currently we have by mixing two samples. So if we mix the two birds singing sound, the mixture is still birds singing sound, right? So we randomly mix the two samples by changing the timing or amplitude and create a new data for training. And it can prevent the overfitting, deep learning overfitting. Okay, so this is the evolution of our proposed method. So we compare with six baselines and first four baselines is the conventional audio feature. And these two, one is used to propose large field, large input field, but with shallow network. And the last two is the proposed method, so we choose all three ideas. And it actually validates the all of three ideas. So data augmentation helps improve the performance for every method. And a large input field is better than any other conventional aggregation methods. And deep network is better than shallow network. Okay, so now we have audio event recognition system. So what we can do with it? Of course we can search the video which contains for example birds singing sound or dog barking sound, but actually we can do more. And I'm gonna show you an example and that is video summarization. As video camera becomes omnipresent, the bus number of video is taken without considering too much. I mean, so it's very casually recorded. So touch data usually very long and shaky. So it's good to have automatic highlight detection system, right? And I think I believe that this system is good step, important step for the high level scene understanding. And conventional method for video summarization usually use only visual feature, but audio is also very informative. Imagine for example, the detecting the highlights thing of skateboarding. Yeah, the jumping scene is most probably the highlights scene, right? And you can expect the particular sound when the skaters jump. So in order to capture such a sound, it is important to design a good audio feature to which is informative enough to characterize such a moment. So we built a new audio feature using the aforementioned deep neural network, completion neural network. So we first trained audio event recognition system as I discussed, we discussed. And then we freeze the network and from the new dataset, I mean the video, we extract the audio and feed it to the trained network. Then we use the activation of hidden layer as a feature. And the feature taken as this approach is often called deep feature and it works very well for image desk and audio task as well. So we extract the deep feature from audio and video, I mean visual information and we build a small network on top of it and train it to detect the highlights for how highlight, likely how highlight the moment is. And it actually, this is the first system which fully did neural network based highlight selection system and achieve the state of the art performance. But I think it's more interesting for you to see the demo rather than figure. So I'm going to show you a demo about this video. So first I play back the original video and then at the summarized video which use video visual feature only and then finally the audio visual one. And then he's moving, go ahead. You ready? Yeah. Yeah, go. That's nice. All of that. That's up. You already got it. You got it. It's recording. It's recording. Yeah. I think it's recording. It's like. You ready? Yeah. Yeah, go. Yeah. I think it's recording. It's like. That's nice. All of that. So let's analyze a bit about these scenes. So this is the highlight score of the first video and as you can see the green line is the visual feature only version and blue line is the audio visual feature version. And as you can see here, you can observe the foot stepping sound. So the proposed method successfully capture this sound and assign the high score on this. Without visual feature, it seems that when the man jumps, the color of background and man is quite similar and she is very far from the camera. So according visual feature only, it tends to extract the moment which the man is very clearly visible. And this is the highlight score for second video. So it successfully capture the assigned high score around the jumping moment and you can observe the jump sound around here. And it is also great that it's not simply rely on the volume. So for example, here speakers, the cameraman I think speak, but it does not go, the highlight score doesn't go up. So it selects really capture the interesting moment. Okay, so this conclusion for the second part is we proposed the convolutional neural network which can recognize the audio event and that can be useful for video analysis. And I think I believe that it's an important step toward the thing understanding, high level thing understanding. Okay, so finally let's discuss about something go beyond the human auditory and which is audio source separation. So what is audio source separation? So sound surround us is actually the mixture of many different sounds. So voice dust in music is complied with many different instruments sound. And audio source separation aim to recover the each music track or instrument track from the mixture. And it's very useful for many applications. Voice dust for video editing, in image we can extract for example this dog by using Photoshop for like this, like that. But if you want to do the same thing for video, you might also want to extract dog barking sound from the video, right? And in order to do that, we need a source separation. Or if you want to walk around the live content before you want to be in the middle of band, then it's more natural that the sound should come from the visual position. And if the market track source is not available, then we need to split from that mixture. One more easily, the karaoke or it can be used for karaoke or music education. Okay, so how can we design or build the source separation? And actually source separation is very, very difficult problem. It's like mixing the coffee and milk is very easy, but splitting it is very difficult. So we use the deep neural network to extract these two components. So first we prepare the mixture and corresponding instrument market track sources and feed the mixture to the neural network and train network so that it minimized the error between the squared error between the target source and estimate DNN output, the estimate target source. And it sounds a bit easy, but actually it's very difficult and our initial trial doesn't work very well. So we developed the special type of deep neural network for audio. And it complies with four ideas. We call this network as the MM dense net RSTM. So first idea is to use the dense net, which complies with the convolutional neural network and the skipping connection for all succeeding layers. And next one is the multiband modeling. So it's important to model the local fine structure information and global information at the same time. So we placed the multi-scale module, so one module for high resolution to model the fine structure and another model for fine, model the global structure. And we combined this information to efficiently model the entire spectrogram. And next idea is the band splitting. So different from image, the audio, this is spectrogram. And as you can see, the spectrogram is different, has different pattern in lower frequency and high frequency. In low frequency, you can observe the high energy and long tone, but in high frequency, it's more noisy and less energy. So it's not a good idea to use the same model for high frequency and low frequency. So we split the band and place the band dedicated module for them to model it, different frequency pattern. And the final idea is to combine the convolutional neural network and the recurrent neural network to have more expressive model capability. So we evaluated our proposed method theory. In order to write theory, we participate for separation contest. And the size is the one of the biggest audio source separation contest for music source separation. And the task is to separate the music to the four components like vocal, drum, space and others. And this is the result of last time competition. And the vertical axis is the signal distortion ratio, which is the metric to measure the how the separation quality is good. So higher value is good. And horizontal axis lists the number of participants like New York University or FHG or Spotify. And as you can see, Sony got the best result. And actually the left, right side is, the yellow bar is the kind of theoretical upper bound which used Oracle data, so which is not available in real case. So we got the best result. And actually we attended this competition for three times and got the best result for three times in a row. Okay, so now I'm going to show you a demo of our source separation system. So first I play back the mixture and sequentially the separate sound is played back. She's the art to come a fascination. That's what's real to me. A red hot chip pack. She's the art to come a fascination. That's what's real to me. A red hot chip pack. That's my creation. She's. How the Sony also, Sony can contribute to the world using the auditory intelligence. And if you look at our product, most of our product has audio IO. And with auditory intelligence, we can provide easier way to control our product or we can create a new content or we can engage the customer to the new content. And I believe that the most of auditory tasks will be done by machine if we have good data set. I mean, data is one of the key element. And someone says that the company who has data will win. And of course there is many data, personal data data in YouTube or any other things. But we have a professional data like music or pictures, movies or TV programs. So by collaborating with them, I think we believe that we can provide to the world in different ways to compare with other companies. And this is my final slide. So I'd like to also think about the future of auditory intelligence or AI. And many species extinct when they cannot adapt to the change of environment like appearance of superior enemy. So we need to evolve ourself to adapt the changing which is happening or may happen in the future. And we doing so by developing the technology. And currently we are developing the AI. But technology also distract the environment and it actually cause the, make our lives actually worse sometimes. And same thing has happened for AI. So sometimes AI reach the technical singularity which means that AI surpass the human intelligence. And I don't know what decision AI makes to make the sustainable world. Maybe reducing the population is the best way. They may just like that. Yeah. So I don't worry much about this scenario but I'd like to mention that all people who are working on machine learning or AI is responsible for future of AI. And I'm currently working on just a research job working auditory intelligence. But I'm always thinking to how can I contribute to the world to make it a better world or fulfill the world with fun or joy. So I'm very happy to share with my works and have a chance to discuss about future of AI in this wonderful conference. Thank you very much.