 Good morning, my name is Mano Revelle. I'm a PhD student at MIT. And I'm going to talk about today what we can learn from social media data about tinnitus. This is a joint work with professors Menchaya, Londero, Teshpen Ratino, and Dr. Spalacius and Bol. As you know, online forums concentrate hundreds of thousands of spontaneous patients reports that are not biased by the medical interaction. Machine learning proves to be a handy tool to extract information from these reports. And we leverage this tool to create comprehensive summaries of the heterogeneity of the perceptions of tinnitus. And we hope that this can provide insights and new ways to think about how to handle tinnitus. You might be wondering, why should we use social media data to conduct medical research? Before I answer this question, let me say that this is not a novel idea. Social media data have been used in other fields. And in ideology in particular, Twitter, Facebook, Reddit, YouTube data have been used to either look at what people talked about or understand the difference quality between social media data and clinical data. We look at Reddit data that have been posted over the span of eight years between 2011 and 2019. And we have 100,000 posts created by 12,000 users. So this is a potential of 12,000 tinnitus patients that have been describing their conditions for a pretty long period of time. What we think is special about this data is that they were not collected through the controlled and filtered medical environment through close questionnaires or focus groups. Instead, they come from this online environment where you have anonymity and where you can spontaneously go and write about your perception. We also think that it's particularly interesting in the case of tinnitus because tinnitus is not well-understood and because there does not exist curious at this date for tinnitus. So it's often the case that people seek for online support or try to find people that resonate with their own conditions so that they hope that they can find someone that can help them in coping with tinnitus. We extracted from the raw data a set of comprehensive summaries of different topics. The idea was to understand what people talked about and how they talked about it. So after processing the data in a very standard way, we ran a latent Dirichlet allocation that allowed us to stratify the corpus in 16 different topics. We can talk in the Q&A about how we got this number 16. This was done through the Congress Science Jurisdic. Each of the topic was presented by a vector of words. And vector of words are not really useful to understand the subtlety, the nuances of what people care about. So we grouped all of the posts associated with the same topic together and we identified the most representative sentences within each of these groups so that we could create these comprehensive summaries that were really generated from what the people talked about but that the doctors would interpret down the road. I want to tell you a little bit about this latent Dirichlet allocation. I don't want you to think that it's just a black box algorithm that you don't fully trust. So the idea of the latent Dirichlet allocation is that the corpus has been created by a probabilistic law of nature. And this law of nature has some features. The first one is that the corpus was created out of a mixture of topic. So most of the world would be coming from, in the sake of this example, two topics. And topic one here is said to be more likely than topic two. And you have a set of words that constitute the vocabulary. And each topic can be represented by a vector of probabilities over the set of words. So given that you're in topic one, you can, for instance, say one of the features would be that cat is most likely to occur than dogs and hello and then orange. A similar probability distribution would exist for topic two. And so given these features that are in some sense fixed by the law of nature, you can generate a post and then generate the anterior corpus. And the way you do this is that you first choose a topic for the first word. And the way you choose a topic is by sampling a topic from the green distribution. Then given that you know the topic, you choose a word. And a word is chosen by sampling from, let's say, the red distribution here, assuming that the topic you selected before was topic one. You just repeat this for all of the words and generate a post. The idea of LDA is to reverse engineer to find these probabilities. Because if I tell you that the word cats and dogs are the most likely to represent topic one, you can eyeball these words and understand that this topic is the animal topic, for instance. The way we reverse engineer is by using a method called Gibbs sampling, where you start initializing all of the assignment of a word-to-word topic. And iteratively, you change the assignment one by one of each of the word by picking a random randomly. There is a theoretical foundation such that the probabilities you obtain at each time step converge to the true probabilities. If you do this enough times, and this is exactly what we do, we run this simulation for 10,000 times to identify these probabilities. Now let's talk about the results. As promised, you have as an output a set of words that are supposed to represent a topic. But I don't want to spend too much time on this. I actually want to show you the digested version of this table. So we could group the topics together. The TMG topic, brain damage, hair loss, and non-music appear to all be talked about as potential causes of tinnitus. We also find tinnitus life impacts, where people discuss sleep issues and depression, cure and coping mechanisms, where people discuss the state of the art in research, how they manage stress, and how they interact with doctors. The peer support topic, overarching topic, where the topic actually identifies where how people share experience about tinnitus, the importance of the support system, how they discuss hopefully and hopelessly their conditions. And finally, this idea of time perception, how the tinnitus came, what is the chronology of tinnitus, and the issue of everlasting tinnitus. Now looking at the summaries, we can understand the subtleties and nuances within each of this topic. The TMG topic, for instance, had three main avenues. One was the teeth and dental surgeries, the other neck and shoulder tensions, and the last, Joe and clinching, Joe clinching and misalignment. And this make a lot of sense from a medical perspective. Looking at the stress management, we identified two big avenues. The first one was people trying to identify the sources of stress that could cause tinnitus. And they found this in caffeine, sugar, drugs, alcohol, smoking, time spent on a screen. And the other one was about managing their stress with vitamins and natural products. And people got pretty detailed in telling how much milligrams they took of each product and how often every week. Finally, if we have a look at the support system, there is also a tension between two topics. The first one being the benefit of having people that understand you online and with whom you can relate. The other one was about the deception of the lack of understanding from the medical environment. As I said in the beginning, the idea of this work is that this analysis can be given to doctors with the idea that this can trigger new, eventually insights or new way to think about tinnitus conditions to better support the patients. To do this, one thing we really want to understand is the topic dynamics, because the topic I showed you are not exclusive. Actually, you have an overlap between some of these topics. To understand this overlap, we want to look at the user's dynamics, meaning that one user might be talking over some years and will be creating different posts that are associated with different topics. And we want to understand how much of a pair of topic is discussed by the same user to quantify what we call the co-occurrence of the topics. And this is a preliminary result, where you see that a bright area in the 16 by 16 matrix says that the two topics at the intersection of this square have been co-occurring. Of course, this data are not free from hurdles. In particular, one of the issue with this data is that people self-select to go online and talk about their issue. In particular, people that don't have, but are really happy with their medical treatment might not need to go online and seek for complementary support. So the way you think of this data is really as a complementary way to interact with the patient symptoms. Maybe something that can bring new insights that are outside from the clinical environment. Of course, this is provided that we can process this large amount of data. And this is the role of now of computer scientists that can support the medical research. And there is a lot of room, as I discovered during this research, for further collaboration with therapists, practitioners, ENTs, to make the most of this analysis and create this feedback loop where the computer scientists can hand in some analysis that can be refined thanks to the medical expertise. Thank you very much. I'm here to answer any question.