 I'm very happy to give this talk. I know you must be tired because it's the last one but I hope you will enjoy. I realized preparing the slides with all the colleagues that our title is not very nice and we have to change it so we have a new title which is much longer and I will try to explain what do we mean and what we would like to do in this project. This project started in January officially and we are still in the starting of the project. So I will just introduce the idea that we have the things we want to do and some progress we just have. Our motivation is about music. I mean there is a lot of information about music around the web. For instance only Spotify or only platforms you find billions of songs and they are usually commercial recordings so they have been recorded in the studio or maybe mastered later so they are good quality, good nice recording. But in addition to that you find a lot of user generated videos or recordings that one has taken with a mobile or with a camera and they are of different quality and different characteristics. In addition music is not only recordings or audio or video but we also find textual description about music. We find images for instance album covers which represent a little bit the music and the style and we also find musical score which is a symbolic representation of the music which is also used by musicians and by composers to create and interpret music. So our motivation is to deal with this large collection of music material. In particular I will put an example of one of our projects which is an European project that finished this year where we have a deal with symphonic music. As you may know symphonic pieces are very long, one hour duration and they are usually recording from different perspectives. You can see this figure here where you find for instance in this case in this recording of symphony number three of Beethoven we had like 17 audio tracks, we had also six or seven or eight video cameras from different perspectives. We have Kinect recording also, we have scores, we have also some descriptors we automatically compute from audio or from video. So we have a huge amount of information in this case it was 15 gigabytes. But depending on the quality you have you can have up to one terabyte of information for one piece performed in a particular concert. So we created a multi-model database to store all this information and to try to access and visualize it and have some meaningful description of this information. In particular we have been working in our department on methods that try to describe separately the audio part, like the auditory information that we can get from music and in another way the visual part of information. In particular myself I've been working on trying to analyze music material mostly scores or audio signals and my colleague Gloria Arau is seated there, she has been working on extracting information from images and from video recordings. So then we try to join forces in this project too because we as humans we don't use only our ears only our eyes but we just perceive music especially when looking at videos we perceive it in a complementary or in a combined auditory and visual domain. So this is what we wanted to explore in our project. So our team is composed of now four people mostly myself or the newest one is Olga Slithovskaya, she's seated there and she's the one that her PhD work on this project. And then Gloria and myself we are trying to contribute with our algorithms and expertise. This is for instance the algorithms that Gloria has co-developed also in the image processing group of ticker flow estimation. And I am trying to contribute with the descriptors I have developed or co-developed for extracting melody, spectra content and harmony from music signals. And we have also one poster which is Julia Norvano who is specialized in evaluation and evaluation of information retrieval methods to check because we found out that it's needed to find which databases should we use, how to measure the reliability of those datasets if they are represented real world problems or not. And we are also having some advice from external people from here from the department. So now I will explain you the title of a project which is very long so first we want to work with music but in particular we decided to focus on a very challenging material which are user generated music performance videos. And I will illustrate it with an example. I don't know if who here has a child or a friend or a couple that play music and in this time of the year you have all these final concerts, you know like the choir and then people go to those final concert and record their videos. Similar way to this. Sorry. And then you see this is your child but you don't want to just focus on others one. You want to focus on your child and then I will play this. So then people try to record but then they are nervous because it's their child. They are so excited. They try to remove the camera and then they talk and they say wow it's so wonderful. And then there is also the grandmother and their aunts which are filming together. And if there is an orchestra there is the whole parents filming and then they have a WhatsApp group. They share all the videos and then you have like 10 or 11 videos of the same performance. So this is the material we are dealing with. Then we want to develop things which are content based. So of course you can tag, you can say it's a cello. The piece is a Suzuki book number, I don't know which one. And you can tag it but you don't tag it in a very exact way. You always tag global tags. You don't know if this video is very long, when is your child playing and where in the scene is your child. So we want to get some information about the content, the video itself. And of course we use, this is an illustration of different aspects of music percentage. We also use the context information that we can obtain for instance from YouTube tags and so on. What do we mean by semantic descriptors? Of course we don't want to provide descriptors which are very low level. Like for instance this is spectral centroid of this frequency range or the optical flow gloss from here to here. But we want to have textual descriptors which are richer than tags. So we want to work on semantically meaningful descriptors and address the so-called semantic gap. And finally we want to rely on methods that allow trustworthy content based semantic description because we realize we cannot trust our performance estimates, for instance precision recall of measure. If we don't know if our data set is reliable, if we don't know if our data set is good enough and representative of the problem. So we want to investigate also how to build better data sets, more representative and if maybe not a lot of data is needed but we need data which is well sampled and well distributed over our problem. So this is our goal and we want to go beyond the state of the art. For instance state of the art in music information retrieval mostly focused on either the audio file or the image of the album. And we want to work on commanding those. State of the art only works on studio recording with professional performance, very good quality material. We want to work with these user generated videos. We would like also to address the current limitation of systems which work with low level and then classifiers and to try to understand what are the limitations and the challenges of these classification methods. And we want to build data sets which are multimodal, not only in single modality and finally we want to go from validity, reliability and efficiency to understanding data set reliability while you are building a data set and after you have built a data set. So this is the things we want to do in this project. And then I will just illustrate with some of the first tasks we addressed which because we of course there are many problems related to this scenario. For instance quality assessment, how can we measure quality in the visual domain, in the audio domain but also musical quality of the performance. How can we synchronize different videos of the same performance? How can we for instance segment long videos and provide meaningful descriptions of one hour videos or two hours videos. And finally how to build mashups by combining videos which have better quality among the different sets of mix. But for the moment we have been working on music instrument recognition. Of course this is a very well known task in the literature. There is a MIRX competition running for four years only in audio. So we wanted to address this task and how does it perform in this particular material because some people may say you can recognize an instrument in an audio signal with a very good accuracy if the instrument is playing alone. It's what we call monophonic. So it's like people think it's a well known problem and more or less solved in the literature. So this is an example. Can someone identify the instrument here? It's a bass but some people say maybe it's a cello. I don't know, I mean depending on the size. And sometimes also depending on the bow but here you don't see a bow. And also if the instrument is in this position people may think it's a guitar because all the guitars are played like that. And of course in this video we don't have any tagging of this instrument but of this other instrument. So this is what we want to identify. We want to identify its play in a violin then there is a bass which is not played and then when the instrument starts playing and where in the scene. So of course there are many datasets for that but all of them are in the audio domain or in the image domain. For instance we selected a set from the ImageNet dataset which have two different instruments and this number of images and you have here an illustration of the kind of images we have. And then we got datasets for audio that are well known in the literature where you have samples of instruments labeled and you see the heterogeneity of number of classes and number of instance per classes. This is something that would be nice to take from publications because you always have to build this table of how many classes, how many instruments and then which are the features and then it's very difficult to compare the results because they are all running in different material. So the first thing we didn't find was a video dataset. We didn't have to use a YouTube API to crawl videos or user generated videos and also this was our first contribution to in this project. Of course videos are tagged but they are not tagged in terms of they are with global tags like this is an instrument but you don't have information where the instrument is playing or when we try to develop an online tool also based on the experience of the department on building annotation tools which are collaborative and then we try to for instance connect with the YouTube or you can upload your video and then you can annotate for instance the region and where the instrument is playing and then we integrate algorithms for optical flow tracking and to track this object in the scene and then we integrate also algorithms for audio processing to detect when the instrument is playing or not you may see there, here, there two instruments and then we can focus on one of them or we can focus on both so this is the kind of annotations we are now collecting. So in addition to that we try to train our models on only images and or on only audio and then apply them to video recordings and we were very unsuccessful for the moment. If you train of course with images and you evaluate or test with images you have a very good accuracy we could even improve state-of-the-art methods. If you train with audio material and then you evaluate with audio material you have a very good accuracy measure, you have even a very good accuracy in different data sets, cross-validation and everything. But of course if you apply on this material this is an illustration of four different videos and on the top left you have the estimation of the image recognition engine and on the bottom right you have the audio so you see there is a lot of errors and confusions between instruments because of course this material you haven't used for training and then we might think okay so let's use it for training so let's build more data, more data and to build it so this is one option but another one is to understand how much data should we need and if not the highest number is the biggest accuracy of course. So we then research also in the same way on reliability of data sets of course there are methods or statistical methods that try to model and simulate data sets that allows you to check for instance if you reduce your data set how representative are the performance measures you get or if you for instance sample it how do you want to sample and you have some statistical measure but also some measures related to the content if you have only piano music or if you have only a particular player of course you will have overfitting over your mother so we try to develop our package to try to measure the estimated reliability of data sets this is mainly the word by Julián Orbano and we have already this package where we try to apply this to existing data sets in the music information retrieval area and also in our own research. Of course there are many people also doing similar work to us and in this Maria de Maeth there is also the opportunity to collaborate with other people for instance with the University of Technology we have been collaborated in the Phoenix project because they have a system that try to identify the faces and then try to identify if someone is playing or not playing because it shows up it was easier to just look at the face of the player and not at the instrument so because they are occlusion they are a lot of challenges so we are trying to collaborate with them in order to have complementary approaches and also tools for video annotation and integrate some of the things they have we also have some collaboration with other researchers that do specific work on a specific instrument this is an example of candombe which is a percussion band and they try to detect when each instrument is playing and they have a specific database we are using for training also our models and then we are trying also to collaborate with people for NYU and Spotify and build more meaningful music information retrieval evaluation methods this is more or less ongoing work of course we have started with instrument identification because we want also to contribute to the initiative evaluation initiative in the Izmir area and of course we would like to address other tasks which are exploiting more the temporal aspects of videos which are not at the moment very complex our approach is not very complex to our coson synchronization and we would like us to provide textual descriptions maybe using some natural language synthesis or textual synthesis or for having our descriptors in a user friendly way and this is mostly everything I wanted to mention I think it took 20 minutes so if you have any questions or comments Hello, thank you for the talk I have the questions and the first question is related to if you have the cocktail boys party which I think is similar to identifying different instruments in a scene can you lie accurately identifying different instruments? Yeah, as I said there are state of the art models we have done for instance that can have training you train isolated instruments and then you can recognize in mixtures at least the most predominant instrument for instance in the image processing I know there's also object segmentation it's not my error but there's no single approach which combines both things in the same kind of algorithm of course it depends on the data you have if you have a very clear recording and you know more or less the style people can at least predict which instruments you will have and it's much easier if you use contextual information to have something like this so it depends but there's some approaches always they are based on training so it's also a bias to the data you have for training Is it a supervised learning? Yes and the ground truth? Is it provided? There are several datasets, the one I mentioned and for instance at the music technology group we publish one on polyphonic so they are soloist instruments and then you can have ground truth from instruments there's another which is called RWC which also even denotes the transcription of the score My second question is regarding the semantics can you identify different type of players? Can you say players of this instrument have a similar style of playing? Yeah, that's a ultimate goal to compare quality and compare similarity on the way they perform not only auditory but also on the way they behave in the scenario or in the visual scene but yes we have some methods for comparing performances for instance you can compare the tempo you can compare the dynamics of the piece you can compare articulations several there are many methods in the literature or different approaches for building comparison between performances Thank you Emilia for the talk I wonder if the extra information that comes in YouTube like commentaries or maybe the title of the piece that has been played all these are you going to exploit it? Yes, we wish but for the moment we only get the instrument tab from the title and to build our first dataset and then of course for training we would like to have more control of the dataset because it's very noisy sometimes this information you have on YouTube and of course these are maybe very long videos and you want to train with the parts that are really containing the video or the instrument because sometimes you have different instruments but yes we use it for a first dataset generation Any other question? Well I guess thank you again