 So this is my turn of presenting the project we're doing within the Maria de Maestu, with a few of the people of my lab. And the title, which is like most of the titles of the project, is a very preliminary one, but Machine Learning Approaches for Structuring Large Sound and Music Collections. I will first motivate a little bit this project, and this project relates, as the title says, with sound and music collections and software. So I will introduce the collections and the datasets that we are developing and using. I will present briefly also some of the software that we are developing and using. And then I will talk about the specific topics that we are working on, which is genre classification and auto-tagging, and more recently related with using deep learning for learning some aspects of this automatic labeling. And I will end with some conclusions. Okay, so about motivation, well, I think the title, it's clear, is the need to facilitate the annotation of large audio and music collections in order that we can facilitate the access to audio content. So there is a lot of audio and music out there, but of course the biggest issue is the lack of proper structuring of it. And the concept of structuring, of course, we can go into quite complex aspects of structuring and go to ontological type of organization of these collections. We have some other projects that work in that direction. In here the aim is not to go that far. We just want to have good labels into the audio content in order that we can build systems to find them. And the concept of labels that we want to act automatically if it's not added manually are things like this. A typical type of labels is what genre of music it is or what instrument is being played or what is the musical key or what is the mood, et cetera. And there is a number of known semantic categories that are useful for facilitating the discovery of the content. The current known problems that everyone in the community agrees on are these ones. The first one is the lack of large audio collection for training and testing. And we could say that even the big companies in music distribution like Spotify or Pandora, they have still the same problem. They have large audio collections but the metadata that they have is not sufficient for developing a number of services they want. They may have the editorial metadata like who is the artist or who is the musician playing but they will not have some of these other metadata that is also very relevant for facilitating the discovery of the recommendation or some of these services that they offer. So this is a problem clearly for research but even in the commercial world. The other issue is that we are talking about audio and we are talking about music content and the features that we start from and that's a very long research task that has been going on for a long time. So there has been quite a lot of progress of trying to get features automatically from the audio content or audio content plus other type of information we might have in order to have good features on which then we can do some machine learning or some processing to get these semantic concepts. And really we are still far from having adequate and robust audio features for many of the use cases or many of the content that we want to label. And finally there is the problem of scalability. We want to really work with large data collections and that means also with large number of classes. If you look at the state of the art in MIR and when they do classification tasks the number of classes they use is really small compared with the real world thing. So typically if you want to label an instrument or label a genre they work with let's say 10 to 20 classes as sort of the typical type of number of classes and with datasets that are on the hundreds or maximum a few thousand instances and for each class they have very few. And that's not the reality. The reality if you go to Spotify or you go to any type of large collection we are talking about thousands of genres that are relevant or hundreds of different types of musical instruments etc. and current techniques are definitely far from being scalable to this situation. So the project basically is to advance on these three types of problems there may be some other problems but we are really focusing on these three and the team that is working on this so I had within the music technology group my lab is the audio signal processing lab and we work on these and related problems the people that are more directly related to this particular project is of course myself and Giorgi Pons who is his first year PhD student and he's in fact the only one funded by Maria de Maes to work on these but clearly the project is bigger than these like Frederic Fawn who did his PhD related with Freesound and so he's working on that area and Dimitri he's working on the feature extraction and on the sentia library and Alastair who's working on the acoustic brains framework that I will also mention about but very much this project is very sort of interlink into a number of the initiative projects and other type of funding that we have to work on this because it requires quite a lot of complementary efforts so let me just give you some indication of these data these collections that we have and that we have been working on and that we use as sort of our research platform to do a lot of work Freesound is the first one and maybe is our oldest initiatives in this area we started Freesound close to ten years ago no more than ten years ago we celebrated in April ten years and we started it for the problem of having access to audio content that we could use for research and also that could be used for other applications mainly for artistic type of applications many musicians want to access audio for their creations and all the copyright issues that are around the music makes it really really hard for anyone to reuse any content so the idea was that was the beginning of the Creative Commons type of movement so it was the right time to start an initiative to gather user-generated content at the audio level not at the music level because this is still a complete nightmare but at the sort of sound snippets type of level which is material very much relevant for a lot of task and creative applications there was no big initiative so we started that and it has been really successful I mean in these ten years Freesound has continually growing there are many millions of users there are maybe like more than four million registered users right now there is around more than 300,000 audio samples but especially there is a good community of I would say quite a few of them are experts let's say that record sounds, put the sounds there, label them correctly because a big effort of it has been we have to try to get as good information as possible to start from because otherwise all the machine learning and all the analysis will not do miracles so let's try to get as much from the community and from a crowdsourced type of approach as much quality as possible and then we can try to build on top of that to get it better so we force users to put adequate tags there is some moderation that sort of controls that and there is some community this is moderated by people from the community and they are really good and they are really good at making sure that the content is as good as possible so by now I think it's an excellent resource both for research but also for practical applications artistic applications, musical applications in fact we have these free sound labs which is like a forum that we develop in which we maintain what we know about what has been used free sound for so for example there is all the articles that have been using these corpus for some research there is some all the different aspects of educational initiatives that have been done using that and some apps or some even commercial applications that have been built on top of that so anyway, so that's a good resource that we maintain from here and I think with not that much resources we have been able to become a very relevant sort of infrastructure that serves a lot of people but there is much more to do and there is much more to do at all levels and especially to improve as the project is the structuring if you search for these sounds the truth is that you will get access to a tiny percentage of the sounds that could be relevant and this is because the tax may not be right or people have not added the right labels to that and there is a lot of issues that we have had a number of PhD thesis addressing different aspects of the limitations or to improve developing automatical tools to promote facilitating the accessibility of the sound by doing of course content analysis and doing feature analysis automatic search based on audio content being able to do sort of tax analysis and tax recommendation for facilitating tagging and well, there has been quite a number of research projects based on that so this is a database of audio snippets and very clearly we don't claim that there may be some music fragments but there is no compositions let's say compositions have a completely different set of rules a different set of concepts that are required to organize it and to have a repository or a corpus of music accessible to the research community that's impossible that is no way no one would be ever able to compile an open repository of music for doing research because simply it's all copyright material and the music labels have not moved an inch since the times of the standard copyright legislation so it makes it really really difficult even to share it in a one-to-one basis I think it's one of the fields in which research because of copyright issues is the hardest because traditionally there has been a lot of money made by labels, by record labels so it's understandable so they don't want to give out that privilege and that business models that they have so what has been our alternative? well, our alternative has been acoustic brains which I think is a kind of a smart way to get around this issue which is we don't have the audio because it's private, it's owned by the labels and people have private collections of music so we don't have the audio of the users but we have the analysis of that audio and if you talk with the labels they would say it would be legal but there has not been a single legal case preventing us from doing that so no one has win a case saying that having audio features of an audio signal is an illegal kind of thing in fact even if you ask the audio labels they would say even to have the metadata of a CD is also illegal because they own the names of the CD cover but again they haven't been able to enforce that in any case that I know so anyway, so the current solution is we have the analysis of audio recordings in a way that is adequate for carrying out a lot of research so most of the research currently in information retrieval they start from audio features extracted from the audio plus information about the recordings, editorial metadata and you do something about that so there is a big initiative for some years that we are collaborating with which is called the music brains which is basically an encyclopedia of music metadata that has millions and millions of tracks with the information of the artists of all the albums etc that has been crowdsourced and there is a very big and active community of people maintaining that so we have complemented that with the audio analysis material and we did it in the same way so we just give a source code it's open source which is this library that we have been developing is Essentia and we give it, it's available on the website of acoustic brains and people can download it they can compile it themselves or they can use existing binaries and people analyze their personal music collection and they submit the analysis files JSON files of all the analysis data that of course we have decided that is relevant for research of course there can be many researchers are working precisely on what type of features so they may not be able to use that but a big percentage of research would be okay with just starting from these audio features especially because it's all open source so they can see exactly how we compute it it's completely traceable exactly what is going on so amazingly enough there has been a lot of people doing that for us so analyzing their personal music collections and uploading the music collection so there are close to 4 million musical tracks in our server with all, not just the audio analysis but all the metadata because the important thing is not just the analysis file but it has all the metadata that comes from music brains and it has an ID system it's called music brains IDs and so for example here on top so this is some of the data associated with the given track it has the music brains IDs well you don't see that and then we have all the analysis data and even we can even link it to a YouTube video that you hopefully the matching is not at the ID level it's matched based on the names but most of the time even so you can actually listen to the song from YouTube and it's out there so anyway, so people upload all this information and then in the platform it's analyzed and then people can use all these analysis data the main reason for creating acoustic brains though it was not this one was to be able to create adequate data sets so as I said the biggest bottleneck or one of the biggest bottlenecks in our field is the lack of well labeled data sets for which to do learning so that means data sets that have labels of the semantic concepts that we want to train with so that being genre being instruments etc so we are still now in the it's working but we haven't started creating any challenges and involving the community but there's a number of people working on that so we can create data sets with user generated labels so that can be used for our classifiers and for training and I think this is going to be a very good way to improve the current state of the art in a number of these things because as I said the maybe like acoustic brains is huge but again it doesn't have some or most of the labels that would be needed for a number of the training tasks that we want to do so this is on top of that that's why we are starting this idea of labeling okay let me just go a little bit quicker Essentia is a library that we have developed for a number of years on the idea of collecting algorithms that have been the result of a lot of the research we do and making it open source so that people can use and by now I think it has become a standard software library for people working in this field it's quite robust it has been used both commercially and for research for many things on the website you can see all the applications and usages that have been used so that's a very excellent sort of tool on which to do things with and test many things so we definitely are using that and in GitHub it's very active Dimitri is the main software developer and maintaining and getting a lot of feedback from the community getting comments and how to improve things so this is the kind of tools that we are building and of course within MariaMess we will be emphasizing even more and making sure that is not just used by us but by other people and then specifically about what you can do with all these I mentioned two tasks one is general classification general classification by itself I don't think it's that relevant because you would talk with every person and the concept of a genre is an ill-defined concept and people would not agree on if some specific genres of techno music is one or the other it's a very cultural and social aspect to that but anyway it's a fundamental concept in which even they may not be an ontology or a clear taxonomy of genres but it's a level of classification that allows us to do a number of things below that so once you know a genre then you can develop tools to analyze aspects things that are specific for that genre but that has been one of the major areas of research for us would be to target specific music cultures and work on specific analysis for that but anyway and within acoustic brains the idea is that you could do these automatically and once you develop datasets you can do general classification but anyway the idea is how to improve that if you look at well you don't see here but if you look at the success of these classification tasks the accuracies are quite low so any current state of the art of genres especially if they scale even to this size of number of genres we get very low precision and recall for that so we need to get better on that another task that is much freer and that that is quite useful for applying some number of techniques is tagging this is an example of free sound people put tags to sounds like in images or anything like this and there are millions of tags and of course the goal is to automatically tag things given that users normally they put a few tags but they miss many many tags so when you so I'm sure for any single concept of the whole all the sounds that could be retrieved with that tag only you get a few because they have not been labeled that well so the idea is how to automatically tag sounds with the existing tags or with some others so you can do some learning on the sounds that have a tag and then propagate to tag propagation to other tags so we have been doing this with a number of techniques and typically with this signal processing approach to do the feature analysis we have been using a number of machine learning approaches to do that and now with Jordi Pons who is the new PhD student working on Maria de Mestu the idea is to start exploring deep learning architectures so this is the main work that will happen for us and of course the biggest problem is that in deep learning the success stories come from image, speech and some other fields but really in music there have not been quite a number of attempts to apply deep learning but until now I think there are not really successful examples of succeeding in classifying large data collections so the idea is to work on that and we have been trying to use different approaches and especially convolutional neural networks but the main thing is to try to understand what's going on so the first approach in deep neural network is to use the idea of image so consider that a sound is an image program and basically you do image analysis on that because that's where the state of the art comes from but clearly that's not adequate in music you have to be able to do a selection of the features that you start taking advantage of the knowledge that we have about music so our first initial work has been okay in definitely in music a square is not adequate time frequency resolution is something that you have to play around it's not a single compromise that you can just fix so the idea is that you have things or aspects of music that are more time dependent and some aspects of music that are more frequency dependent so you have to at least have some feature selection some initial preprocessing that tries to emphasize some of these specific perceptual or musical concepts especially given that we have not huge data sets so there is no way you can allow for a deep neural network to learn everything because we don't have enough data for the system to learn everything so we need to add a little bit more knowledge at the beginning anyway best paper award in this workshop that happened last week at CBMI and the idea was okay so not just to blindly do deep learning but try to understand the architectures and try to see if we can tune and develop architectures that are musically motivated and that was good and just to conclude so in terms of the goals of the project is to create and exploit large sound and music collections as you have seen and the idea is to develop task specific audio features for the particular problems that we are focusing on and in the case of deep learning to explore the specific architectures that can be better used for music and in the sense of particular sort of semantic tagging this idea of a generic tagging system using deep learning networks can be a quite interesting area to see how it can be applied to and that's all so thank you very much I guess I just write on time but if there is a quick question otherwise we can just start any question, comment did I miss anything? here go ahead I like to use the deep neural network to strike features and do some like features traction in order to make the network to learn yeah that's the idea maybe Jordi can answer that yes so a part of the deep learning game is actually to have some input data and to allow the network to learn from there so actually would learn features from data and then it's interesting to try these kind of architectures where you can actually enforce the network towards learning something that has meaning in the musical sense so we try these things in order to improve our results but also to try to understand what's going on there and this is a bit the story of it I don't know if I answered so that's interesting question so people want to use end-to-end learning so meaning that audio music is essentially audio and therefore you have the waveform there it would be great to use directly the waveform but so far this doesn't work so people use the spectrograms and try to learn things from there okay thank you so I guess we should stop here and ask Leo to continue I'll be here