 Welcome everyone. We'll have a talk by Claudia about Python and Rock'n'roll. Please welcome Claudia. Thank you everybody for coming. My name is Claudia. I am working as a data scientist at Kerminal Analytics. You can find me on Twitter as Claudia Girau. By the way we are hiding so in case you are interested in joining us, please contact me after. Well, I am talking today about a personal project. It is a ball where data science and rock music and Python joins. It is mainly about satisfying my curiosity. I am a rocker, I love rock music and I was asking myself if I could do something to know more about a passion. Because of I know some of you were yesterday at party and laughing and rocking. I will reference some good talks and excellence speakers during this week. They explained very good better than me. NLP, database and some music talks. And also in case you missed the keynote of today, you should take a look after because Gail explained my topic so much better than I could do. So please review it and it is quite interesting. So the main development stages were these four, data graving and storage, data processing and turn frequency, clustering and topic modelling and the future steps. First one is data graving. As I said it is about rock, about lyrics and NLP and there are two ways to get the data. Life is scrapping, go to concert, have fun and another way faster but less fun is web scrapping. All the data set were scrapped from MLDB using request and beautiful soup but in case you are going to make this massively you should take a look to scrappy. Use for my convenience I storage all the data in MongoDB. This is me working hard for this talk in the party's mid concert and I have to make two disclaimer. The first one is the all groups were selected considering my personal preferences in case you want to add another one, contact me and I will do it. And the second one is a legal disclaimer. Lyrics has copyright so manage with care. I cannot publicly share the data set, maybe some engrams but it is illegal stuff. Well this is more or less how it looks like in its raw version. I have the group, the album, the song, the website and also the raw lyrics. I have scrapped more than 7000 songs from more than 50 groups. In case you are not used to work with MongoDB it is only a group by query, by group. So we have the top ten groups spray or the more productive groups by the number of songs. And I have a nice snake with the frequency of it. Well next step is the data processing. We have to clean the lyrics and avoid the meaningless words, avoid the punctuation. And if you take a look I have to add to the regular stoppers, the stopper in music. It is full of that kind of stuff and I have to clean it. Perhaps I forgot someone. Also we have the English contractions. And what I have done is process the raw lyrics and storage in another collection. It is called cool group process. Well, once I have the cleaners that said what I have done is start satisfying my curiosity. The first thing I did was compare the my corpus, the lyrics, the songs with the brown corpus. In case you don't know what it is, it is a collection of documents done in the 60s. And it is compiled with different fields of documents. Unfortunately it doesn't contain lyrics. So it's regular. So what I did is to take into account the term frequency in my lyrics and the term frequency in the brown corpus and with the previously formula get the rockness index. And not surprisingly, those are the ten most rocking words. Hey, goodbye, baby, tonight. So everything is going okay. And those are the rocking words. However, general, schools, military. I still want to play with term frequency. And if you consider the full corpus, I'm sorry. If you consider the full corpus, the most frequent word is love. And once again, if you plot it, and if you plot it from my favorite groups, maybe, love is one of the most in queen, is one of the most frequent in the Beatles, also in David Bowie. But not in Urbana. Another question came to my mind after this analysis, half of lyrics and extensive vocabulary. What I did is take into account the length of the unique words in each group. And considering grouping by the band, the most rich lyrics group is Bob Dylan and Bruce Springsteen. And the last ones, Motorhead, ACDC. So I did the same by grouping by song and considering the mean. And you have that surprisingly range against the machine has an average of 93 words by song. It is rare. But considering that Bob Dylan and Patty Smith are almost petitions, it is not a strange that they have more than the average words per song. And on the flops we have Radiohead, because they don't have so much lyrics. Nirvana and the Beatles, they shown fun, but they didn't actually use so much words. Last question was, is rock about sex and drugs? Hmm. If we ask about songs with love, kiss and sex in each lyrics, we have 1,700 more than 1,700 songs. It is almost the 20% of the songs are talking about love. But if we ask the same about drugs, only 24 songs are talking about or half there were drugs in it. So I prefer to change that sentence by this one, peace, love and rock. By this point we already know that grouping by band only let us a few documents. Rock has not an extensive vocabulary, what disappoint me a bit. And also we have the intuition that rock songs are always talking about the same. The same words, there are not so much difference between groups and not so much difference between song. However, I didn't surrender and I tried to make some more sophisticated techniques. And I prepared my dataset, this is what Gail explained this morning. I actually tokenized and stemmed the words, get the data, prepare it. And as I said, we have only 100,000 or more than 100,000 words in our vocabulary. That is a small vocabulary. And what we do next is the 10th frequency inverse document frequency algorithm. It is classical definitions by formulas, but what we have to take into account is that these algorithms allows us to get the most relevant words in our songs or groups in this case. We do the feature extraction with cycle lem. I'm going so fast because you already have this presentation and I don't have so much time to explain the algorithm. And we already have a matrix with the similarity into the groups and we are ready to cluster in. What I'm going to do is came in. I tried with three clusters, but unfortunately what we have at the algorithm is something like that. Three groups and some words that are not so meaningful. I cannot detail this cluster, I cannot detail also this one. And the groups are not so... If I did it by hand, I didn't put these groups together. But if we take a look at the plot next by... I know it's small but if you can see the closer groups are the more similar and it makes a little bit more sense. We can see clearly any cluster. The main conclusion is by this algorithm we cannot cluster or group our rope bands. It is because, as I said, they are so similar. Once again I didn't surrender, I tried another technique, I did it by hand. And when you plug it, also you cannot see any cluster in the groups. And I was so sad about it. Once again I will try another technique. It is LDA for those who are not into NLP. I will make an easy explanation of that. If you take a look at the algorithm, I will make an easy explanation of that. If you take a look at these five sentences, it is obvious that the first and the second one are talking about food. The third and the fourth are about pets or animals. And the last one is a mix of the previously topics. This is what LDA does. I did the same with my songs not grouping by rock band. And what I get, I did ten topics, model. And what I get looks like that. Once again there is not so much meaning. It is ambiguous and I don't have an answer about topics looking at this. So I move forward and I try another more complicated technique, word to beg. I don't have so much time to explain it, but by approximation what I have after the algorithm is the words closer to one in the corpus. So love, these words are close to love, making love, tender love, true love, avoiding love. And if we try the same with Riot, we have Vebrilla and Cycle and Hooligan, well this is going a little bit better than the previously. By the way, as I promised in my summary, we are going to talk about the BitBoy evolution. I know everyone loves the graphs and the plots and also I. But what I did is take into account the main, the most frequent words in the discography of David Bowie, grouping by album and sorted by the release album, the date of the release. And if you take a look the blue dots are, sorry, the blue dots are love words, and you can see in these albums they are not so frequent, maybe Bowie was sad or not so in love, and in this album he don't speak so much about it, but in this one, never let me down, he repeated this word more than 25 times. As a bonus track, I have the queen evolution, a group actually I love. I did the same but with these four words, love, time, want and know, and I plot it. It's always about love, I stepped in the first two albums, hope you can see it, if not you have, you already have it online, but in the queen and queen too, it's not so much repeated, but in the middle of the discography, they repeated more than 80 times in another room, it's so much. Well, what's next? I did this project in less than one month, it is by pleasure, and I would like to, what I would like to do next is do it massively, scrap more groups and more styles and try to make some clusters that actually works, include some patterns, and my final objective would be made an hybrid recommender system. So of course for this, for get this thing done, I need some help, so feel free to collaborate with me. The notebook will be, you have, and the PDF of this presentation is already online, you could check it. And also thank you for your attention, keep rocking it, see you at Pycones this autumn. Thank you Claudia. So we have time for questions. Hi Claudia, thank you for your talk, it was really amazing. I was wondering, like Palpe in blue, wasn't it, who were together? Ok, so do you think that maybe there's an American and English language affecting like some words maybe are closest, because the dialect could be. Thank you. Probably. Actually there are more British group than American in my corpus, maybe that is affecting my study, but probably they are using the same words. So thank you. Thank you. Are there questions? Hi, thank you for your talk. You said that there are groups for most times very similar. Isn't that because you only took groups you like? Yeah, probably. Well I think 15 groups is not a few, I think it's quite big, but probably they are so similar because I chose it, I like the same groups and I found it through the history. But mainly they are about 80s and probably it is affecting, if I take some more nowadays groups probably it changes. Thank you. I don't know anything about natural language processing, so maybe this is a stupid question. But do you think aggregating a lot of more lyrics and trying to tag manually some of them so you can help the algorithm will help finding more clusters or finding more relationships between words and stuff like that? Well, if I add more lyrics. Yeah, if you add more lyrics and also if you tag manually. Okay, so this song is about love and friendship and whatever. It was really hard to make. Probably it would help to use another classifying algorithm that I'm not using because I don't have a target actually. If I had tags or something like that I could make it. Thank you for the talk. Drugs. Have you thought about euphemisms? I'm thinking drugs and sex are things we don't normally talk and in songs don't express in those words. So I'm not surprised you haven't found a high count of the word drugs. I'm not a native speaker but I'm sure there's suggestions of how we talk about drugs. Have you looked at them? Actually I didn't want to put explicitly so much words in that field in drugs. Probably they are not using explicitly the word drugs also not the chemically substance. So maybe they are using synonyms but I'm not English obviously native so I can... I don't know how put it in the query. Thank you for your attention and thank you Claudia again.