 Thank you so thank you for very much for the invitation well first thing we want to present ourselves and I'll try to make the The talk short enough so you can have time for questions afterwards Basically Serendipia is a very young startup We've been around for less than two years and we are very much focused on deep learning techniques So our aim we come from you know this typical background in data scientists coming from PhD on computer science and artificial intelligence We got together we have been working in different domains like consultancy firms another technological firms and also startups So we decided to you know start our own business plan. So basically we are doing Professional services oriented around deep learning and we're not talking about any of that today, okay? So you have any question about that? Yeah, we can talk about that later because the other thing that we like to do is research and You know the innovation in the sense that we don't see actually Business model behind yet. So also if you see some business model for this, we're happy to hear about that Because we don't see we have problem seeing that so this talk is very much focused on you know innovations in this in this area of you know preventing Online harassment and that's what we're doing and how we can use AI for for that Okay, so I will be doing the Talking now and in the during the question round Jorge and I will be happy to take questions just one quick presentation about Myself or the Area in which we focus when we work with in this domain is that we want to be very cross-disciplinary So the main focus for us is cognitive neuroscience So of course, you know about AI because you're here, you know about Big data environments and so forth a lot of unstructured data, but we combine that with you know our Experience in cognitive psychology and also neuroscience. So we try of course There these are levels labels many people talk about this But it's quite difficult for my point of view my experience coming from, you know computer science at the beginning and then from psychology there is a lot of you know a Lot of a great breach between different realms I mean we are in different domains. We don't share actually. There's there's some talking about being multidisciplinary But I don't see that very often, you know happening in the real The real thing. So that's what we are trying to do. So and for that Have kind of two parts of two blocks of the stock So first one is about the problem domain and that is very much related with the problem itself I mean cognitive psychology how people experience these problems of online harassment and the second part is just It's now the more technical approach to that problem, but we really need to understand the problem first Okay, so and actually I'll be talking a bit more not not just these two points But of course, it's the funny thing I think the important thing for a talk like this is that you have a feeling of how it how it is working So we will present how we have addressed the problem and how we are doing so far so actually we are happy also to take any suggestions or Critique or whatever you want to share afterwards. So online harassment every one of you I think more or less, you know what that means You can talk about harassment or abuse many forms of it and of course This is the world we live in. I mean you have the Some of you you are gamers. Some of you spend three quarters of your life connected to social media some of you know, of course messaging apps blogging blah blah blah so Things that usually happened before this digital era Some bad things or good things that happen in the offline work world now are happening also here, so and we need to identify What they are and see for those of you who are like me very techies, I mean again, I'm a computer scientist We really need to do a good work doing the engineering requirements engineering We really need to understand the problem if you start coding if you start Coding in Python to solve this problem. Then you have a problem because you're not understanding what you're doing. That's why I'm You know investing this time so humiliation threats and virus men discredent intimidation extortion stalking impersonation You are familiar with a hope. No, not very familiar, but you're familiar with that. I mean, you know what that means, right? Okay Maybe you know it because you have suffered some of them. That's for sure Maybe you know because you're a bully. I don't know. I can see your faces right now We can settle down later. Okay, so And the funny thing is everyone everyone is involved here because sometimes we see these problems like being you know someone else's and actually It involves I mean, it's it's how our society is evolving right now So, yeah, of course, we many times talk about vulnerable groups This is not a talk about vulnerable vulnerable groups because everyone is a target here everyone is a target as Well as everyone is a potential You know offender in this case, so it affects everyone and of course Yeah, and we have some responsibility here Sometimes we're talking this kind of conferences. We talk a lot about tech about technology and we're going to do that in about a minute but again It's important to see which what is our responsibility as You know experts in technology or implementators we maybe maybe we are not policymakers But we are you know the ones in charge of Defining what technology is going to do so we need to get in touch with this That's something we cannot ignore basically. That's the point. So woof. I'm not going to define all these things Right. This is like that cloud kind of thing just to probably you're familiar with some of these terms others You can search now in Wikipedia I'm not going through all of them But the idea is and the focus of this talk is how we can fight fight this Phenomena so one way to do that is focusing on language some of them they have other components, of course, it's not all about verbal communication But but language is quite I mean this key is essential about you know, because it's conveying meaning and these Online harassment problems are related very well related with meaning. So one has to understand what meaning is about here So Let's see one example to clarify what we're talking about right and see the complexity about this Okay, listen to you. Maybe you know this movie Ladies and gentlemen electric cars They're totally gay It's true. I don't mean that they're homosexual gay But I do mean your parents are the chaperone at the dance gay Right, you tuck it in and wear it real high gay I don't want to disrespect anybody because I'm not about that But I think we're understanding what we're trying to talk about here, right? I mean honestly the Nissan spit really It screams this. Oh, here we go the Hyundai pomegranate, right The Chevy Fingerprint now Okay, I'm gonna stop to be the cat Because a cute cat. Okay, so you see what I mean. It's not easy because maybe I mean We're going to talk about supervised learning, you know, many of you are familiar with machine learning deep learning techniques We we have unsupervised supervised learning This is a problem for super supervised learning and as you many of you who has who have worked with Machine learning algorithms, you know, how important is the training data set, right? And you have a good labeling or a good tax data set But the problem with this is What is correct here what is not correct and that's why I bring up the topic of political correctness is maybe this You know this scene was okay like few decades ago now is not politically correct and That is also not only dependent on time, but it's also dependent from, you know on Different cultural backgrounds, etc So it's quite difficult even though you have the right algorithm for this The real challenge is probably in the domain of having appropriate Training data set or the supervisor who is the supervisor in this supervised learning? So that's why you know people like infinite in this vanity for article They are wondering are electric cars gay or is Vince Bowen an idiot or both? so It sounds silly, but it's a really serious question because if you are to train a Machine learning algorithm what you need to know this? I mean you need to solve this question Otherwise you're doing, you know, are you familiar with the IGO garbage in garbage out You're doing a predictive model who which takes garbage in and gives you garbage out, you know trash Crappy thing not valid. You're not doing your your job. So wow, that's The difficulty here is more on the data set So of course political Political correctness is something that is being solved by Trump as you all know because for instance I'm not I'm not playing this video, but Trump has been criticized many times because he uses words like or terms like anchor babies and He's saying you know Political correctness is too slow for me. We don't have time for being politically correct So that's why that's one way to solve the problem. It's just it's just saying okay. I'm going to offend people and I don't care That's one way to solve it. Yes. I throw it to you. So see so you can You know think about it Of course, and we come to the root problem. So we have language language is It's extraordinarily complex and we have infinite number of meanings and we have To build a model which is able to comply with what we need to control or prevent in this case online harassment and and it has many many many forms and that's the You know anchor babies instead of saying all the babies or immigrant mothers who are trying to take advantage of birth birth Right nationality, right? You know, that's one thing this is for those of you speaking Spanish and that's the Google translate version and I won't not I'm not going to say out loud because Yeah, it's not nice. Okay, but that's a that's a real that's taken from El Mundo newspaper yesterday Okay, so so language problems with language appear all the time and we need some or we want to offer some AI solution for that Of course non-verbal is important as well, but in this case we are we are focusing on language But we know non-verbal communication is there as well and we need to take care of that But we will do later as we we decided to focus fast on verbal communication So, okay, let's go for the technical side of it. How do we? How do we solve this? of course Now from the problem domain the analysis of what is our problem? So our problem could be solved many ways. These are the ways that we came out But probably you know others. I mean people working with In this area, probably they know other solutions. So one solution for me is forbidding online access like even for forbidding online access even for for children Totally so they are not exposed to potential threats or potential dangers Of course, as you know, as you may be thinking we have to do some trade-off And yeah, of course, we need to be in the online world. So it's nothing that's not problem probably the way So monitoring the communication so monitoring and I use the brackets for spying because I can I cannot really see the Difference between monitoring tracking and spying One different and one thing it's important for us. It's like maybe if there is a machine who is monitoring Your the communication and it's not another human. Maybe that is not that intruding This for me because I'm used To I'm used to use Gmail and I don't feel bad because some machines from Google and reading my emails But that's me So a more important is You know and having a moderator so the supervisor or in Decades of vulnerable vulnerable groups like children Of course, we can think about parents or educators or someone but again in the online world We don't have the capacity to have humans taking care of that. That's why I You know, I think we can think about AI solutions in terms of having some kind of bots or some automate way to do that But maybe in in collaboration with humans so think about here I would tend to think more Not automating everything, but I'm kind of thinking more in a hybrid solution like, you know collaboration between machines and Feeware humans. So, you know Improving productivity while doing this And we come back again to the problem of Data sets. I mean, what is the normative? What is the good training data set should we use and that's again the problem all the time So, of course, you know about a natural language processing here. We move to a more maybe Appropriate them or domain which is natural language understanding because we need to extract meaning There is a lot of meaning and all these Constructs that we were referring to you before Like, you know threats insults being a stocked all these is Convades in language in the form of meaning So we need to extract that meaning in order to detect that this is actually happening in In a communication. So Yeah, if you want to see that from the point of view of pure analytics We have a problem of classification. Maybe some regressions. Let's see. So you are familiar probably with sentiment analysis solution Like, you know classifying text positive negative neutral not valid enough for this. Yeah, that's clear. So well So what we are modeling this problem as a supervised learning problem as we said before We need to you know categorize or classify the text or the Commence or the documents with categorical labels So we have a multi label scheming which we have these sets of different I'd like to say constructs constructs because they are or you can call the traits or you know features They are not features actually. They are more label. So think about The parameters that we want to measure in online communication so we can detect online has many general Right and we have many dimensions. So we of course, we have a high-dimensional input Which is language really really high-dimensional input and then also we have a multi-dimensional output We have to deal with that and Also, one thing I didn't mention is that we wanted to put a focus on Spanish language because we are Spanish we work here And we have the you know, the Spanish speaking community with usually If you work in the area of natural language processing understanding, you know, English is usually the first option for people like Google or Facebook We have these other goals that having you know A policy a Spanish first policy like Trump with America first we are a Spanish first So we build a comment scraper with Menema news aggregator and that's a nice place to have, you know, a good source of Potentially toxic Commends, you know all sorts of things for instance you have News and in this case, this is also from yesterday you have like 40 comments about it All sorts of comments very complex very a lot of variability in data Wow, and then we have this when I say wow is because I'm thinking from the computer science of it Point of point of view is that's a huge problem high-dimensionality again. A lot of questions to be solved With language, one of them is the unit. What is that? Decisions they call it unit of analysis. What is my unit? What is my exemplar if I'm doing? You know machine learning a half pairs X Y. What what is it? What is my X? It's already a difficult question even before you started to do any any model training And if you look at actual real examples, and that's the The idea here to have a actual product working in the real environment you find in Meneame you find things like this and This is this could be the unit of analysis and if you look at that Ninos mama Petrito may have a lot of gold on our house of Petrito He's don't know and those are the two decent denuncia a lack of support on it Because I've committed a little audio Yeah, very very very difficult. Sorry for the English Okay, you didn't understand anything but yeah Google to say Google lens So Very difficult. I mean you get it. It's quite difficult to decide how to operate Operational I don't say I cannot say that word how to make this operational, right? It's very difficult so I Don't know how many of you know traditional nlp natural language processing techniques. I mean I say traditional I mean count based space space models So as you probably know if you are familiar with machine learning We always do the same we take inputs X X 1 X 2 X 3 X 4 bla bla We do maybe some pre-processing in order to get a feature vectors. So we describe the original raw input as features and then we provide the Predictive model with these features in order to do the prediction That's the machine learning Original way, of course if you are very lazy and you like deep learning you forget You have this dream Which is not realistic of? forgetting about feature extraction and Providing the model directly with the excess with the input with the row input and you end up with things not working Most likely of course it can be done, but it's much much more difficult and especially with language being such a complex domain Usually we end up doing feature extraction feature. We build the feature vectors so traditional way in nlp to build Feature vectors is this bag of words team for the FIDF like was doing for many many years One hope representations. I mean if you are familiar with data science You know these terms if you're not basically look at the picture and you have one and zero counting words based really base in in principle very simple But it it might be very powerful though because he can do a very good Document class if you care or twin using this and it works The problem is not versatile enough for for what we want because we really need to grasp the meaning here You see the the stress. I I put in the beginning Meaning is key here. It's not valid to do something like counting or detecting some keywords keyword based AI is not enough for this. So that's why we go for deep learning now Of course, it's fashion these days is trend. We need to talk about deep learning. Okay now we doing it. So Instead of just You know making a book a seat for deep learning. Let's just say we change in the model from a count based Space model in which we describe we derive features from the text Count and counting words instead of that. Let's go for a neural representation learning and Deep learning is mostly about Representation learning What does it mean representation learning that's the important key here When you have a classical machine machine learning algorithm You're actually focusing on excess inputs and wise and outputs and you learn some some rules And then you get your predictive model working and that's nice but here what we put the Focus on what happening what is happening in in the hidden layers in in Inside the model there is some learning happening That's why I used this illustration and you kind of reminding that it we humans We understand we have meanings for words and that is stored in our brains. So our brains have Solved the problem of representing meaning of representing words I don't know if you have given much thought about it If you think now how your brain is storing Words and meanings how the word chair or the word screen or house is stored in your brain Is it one neuron is it is the house neuron and the cat neuron and the dog neuron And then if you drink too much one night and just and you know when you drink alcohol You're killing your neurons, right? Maybe your dog neuron is killed that night and you are not Able anymore to see to say dog. I don't know that that happened to you Okay, I can see just one member of the audience The others are not alcoholic yet. That's good Okay, so it's not that way. It's not working that way of course So how it works So that that's a complete different talk, but I just wanted to show you some fancy pictures of fMRI because are nice Really nice to see those brains with colors, right? It's cool. It's kind of cool. Okay But what's the the real point here because we are not going to explain all the broadcast area blah, blah, blah Basically what we intuitively know is that the brain has a way a more distributed way to a store meaning in Not in one neuron like this simple example I was giving earlier but using a distributed representation of words in Neural tissue in the nervous system So the nervous system is very effectively using a set of neurons or other cells In order to store a myriad of different meanings of words as a network and that Network you could think about okay if I as an engineer now You know, I forget for a moment that I'm a psychologist and I don't care about the pain of others. So as as As an engineer the my way to understand how the brain a store words is to open your skull Use a lot of microelectrode Of course, you are killing the process and it's messy But but but at least you get you know a real reading of what's happening in your in your brain Probably you will be much distressed and then the only word represented in your neurons. Oh, I'm gonna die. I'm gonna die but yeah So very difficult to understand how brain does that we don't know but we have this intuition that the representation of words is done through a distributed very robust representation which could be translated into numbers imagine that you know the brain is a electronic system and it's Working with impulses. So if we can if we could measure with these microelectrode all the Numbers about the how many mini balls are running through every sign ups in your brain in your language areas When you are talking of what you are understanding some messages, then that is what we are looking for the Really good nice effective representation of meaning and that is what we do in deep learning using this yeah now I mean from this this the famous paper from 2013 about How we use these continuous back-of-words and skip bra methods in order to learn this Representation so you see what I'm going we are able instead of using count by count based Feature vectors. We're going to use different Representations which are much more effective representations and much more bio biologically inspired representations. How do we do that in using? You know artificial neural networks or artificial motors just doing this trick Which is basically masking having a lot of data input with correct or more or less correct Language and we must one of the words and we make a model able to predict what is the word which should be going there Once you have trained a big enough Neural network which is able to predict the words in their context and here is the trick So the trick is Words are not isolated things. They are they are not entities with meaning by themselves They have to be in the context of other words Of course, they have to be in the context of the real world But now because we are focusing only in language. Let's say about Language context. So what we are doing with these Approaches which are what you know, you know of them like war embeddings or word-to-veg or I don't know Dr. Beck many names for the same thing for war word neural representation learning so Once we do this we can get we do this by training Neural network model and artificial one and of course as input we present our hot encoding Words outputs the same here. We are not interested in the output. What we are learning is the hidden layer So we are learning a feature map or a feature vector See so instead of the traditional method in machine learning where we are humans the ones in charge of Building the feature vectors. We are relying on a neural network to that and for that we need literally billions of Words to learn that another way that's n-dimensional vector, which is also called the latent Currification it's latent. It's hidden. It's within your brain within the neural network now with this Came a lot of power. So Everyone if you are from this domain if you are familiar with this domain Everyone is excited about word-to-veg Google is using word-to-veg everyone use embeddings We want to use we had a chat before Entering here. We were talking about online travel agencies And I said, oh, you have to use a travel a traveler embedding because we are obsessed about using embeddings because they have a potentially a good Potentially they have a very powerful effect in terms of being very useful features for Predictive tasks like like the one we have in hand today. Okay, and of course when they project in These are principal component analysis with two dimensions. They project Words that are now described as vectors. So the key important point here is one word is not is not not not only a word now It's not a one-hot encoding One and zeros is a fully, you know, trained vector with a lot of numbers Maybe 300 400 dimensions and when we project these 200 or 300 dimensions on to two dimensions And we see how concepts relate relate to each other We can find things like these king and queen man and woman relationships that that this Meaning and dimension at a space Represents meaning somehow in a mathematical way. That's what we wanted because we are actually building algorithms that are based on mathematics so the only way to actually Implement this is using mathematics, right? Okay so basically just for you to see an example of things that we do with This kind of representation is if I have a sample of you know for Documents that I have here for from a different experiment. I can translate that into a 400 dimensional Vector feature vector that it and it looks like that and of course I Just saw you this to Yeah illustrate that it is a black box It's a black box model because these numbers Have been learned from in a latent space in a hidden space from a neural network I mean they don't have any explicit meaning from for a human looking at them But it doesn't mean that they are not useful. Yeah so Of course in order to see that more intuitively intuitively you can see a D at that if we can as we have an space Full of vectors any word is represented by a vector We can do operations with these vectors like some in subtracting Multiplying blah blah blah and also we can calculate distances so what we discovered not weak but what the these researchers discovered is like they have I mean we have now the possibility of finding measurement in Distance of meaning Likeliness something like that or similarities actually so most similar words to Nino in the Spanish Datasets that we have been using This or things like that The Queen the this is the very the very typical example made now in Spanish if I take the word ray I subtract hombre and then I add mujer I get a queen I get reina and that kind of operations they They show how we are able to manipulate now meaning from a Mathematical point of view and that's very powerful in Imprisonable, you know potentially Okay, so now we need one big set of the old possible words in Spanish that can be you know Represented as a word vector and for that we had different of course everyone in English is using glove or other very typical You know already trained word to back Libraries or data sets, but they are in English glove is the one from a Stanford and everybody's using that But they don't have a Spanish version. So we have been working with the Spanish 3 billion a Spanish 1 billion basically researchers they use Wikipedia and use web crawling in order to get Millions and millions and millions of sentences and do this learning by masking and predicting the context of a word, right? So once we have this Now we have another problem now we have now we need a sequence model because now the we have the text we have a long text maybe or short text and The unit of analysis that we have in mind that the one that we know we want to know if it's an abuse or a threat or a humiliation Then we need to translate that take sequences a sequence of words into Again a feature vector of representation up for that. We can use recurrent recurrent networks like long-short and memory networks So those of you walking in deep learning are familiar probably with LSDM's So we're working in of course. I'm not going to explain that, you know, because we don't have time for that We basically just for you to know that we using LSDM we use Grooves as and then we also use a by LSDM in which we have to Combine a forward layer and a backward layer so we can get context in the sentence from the beginning to the end And also backwards so it's more informative in the in the sense of modeling What is seen in the text? It says what's the meaning conveyed by the the whole text so and That's what we have now And that's what I'm in the last four minutes that I have I'm going to show you how it works or two minutes Okay, so Let's go for that because of course there's room for improvement One way to improve that is to get more data set to get Courses in approach in which we ask people like you to connect here and say and rate how, you know, how much is the how Chobin is made Chobin is this a message and yes, that's very Very difficult to get a lot of people doing that. So, okay, let's go back and see how it works ours so our Initial product is called a maritas We have been testing it for some time now We have testing with these things like electric cars totally a gay thing and you can see how the model is able to predict or to classify the toxicity and if you look at the latest label hate is Identity hate is how, you know, the gaze as I call it Group is being hate in this in this sentence This is another example for for Spanish Commenters Okay La mujer de granugilla y se aprovecha mucho de hombre blanda Sentence which is almost impossible to translate into English. I Tried I swear and then But we use we wanted to test a maritas with a real training set Yeah, you know the train validation test. This is a real test. Fari. Fari is the ultimate test Okay, so I've already tested blanding a man. It's it's almost impossible to You know why changing languages is actually changing culture and that's why meaning a word to beg You know the codification if in our brain of words It depends on the culture we grew up and that is also true here. So of course In this case, we we are able to dissolve to detect insult and toxicity But it's difficult because of the blanding a term which is probably in north included in our word to beg vectors and then get what's not there Pa la mujer es picara women as maybe she was you can see how we detect the hate towards the identity in this case women insult also toxicity La mujer de granugilla es aprovecha mucho del hombre banden de We can see here. We can also detect the longer the unit is much better for the model to detect how toxic is the The text and this one I'm not going to slay again, but from our The League or whatever Yeah, very very yeah almost to the top, you know one one one except threat except threatening Yeah, it's very clear right and of course there are things to improve a but we are doing fine with you know distinguishing the nuances in different Again, this is an example where you can see clearly that it's not about keyword detection. The problem is actually Finding these nuances in in the text and of course we want also to Apply these very same methods to other problems like a lexidemia, which is the difficulty to identify emotions Kind of the same problem, but a different one just to for you to have a feeling that we can Work with this we saw the domains and also of course we'd like to include that I have fun doing this actually How we can include also the non-verbal domain So we're happy to take questions now if we have I guess we will have 15 minutes and of course. Yeah, I test I Take the opportunity of everything to test our model. So of course We are happy to take questions in a very, you know kind manner without any abuse Okay, thank you very much. I don't know. Do we have time for questions? Yeah, go ahead if someone has any question, please go ahead But the business case have you thought about using it in schools to prevent harassment and I think in America No, I'm preventing it. You see in social networks Facebook Twitter Yeah, that was actually the One thing about the social media is Facebook Twitter set up They are supposed to have their own models and they have a very position in terms of you know Having the training data set and they're using the their own models. So in that case we see I don't see I don't really I find difficulties finding the Business case there because it's like It's their business case not mine. That's one way but the other The other domain schools or communities or you know social I mean smaller social networks That could be one and for that we need an alliance with the platform platform providers We actually didn't explode that yet in we are in that process We wanted to have the model working fest and then going to the platform providers and say hey You could use this as an API. So actually what we are working in that is that it's on having a Ready to use API that any platform could integrate something like that Thank you Okay, I don't know. Do you have any more questions? Okay, thank you very much