 Okay, the future and social science. You saw the blurb from my talk as my friend John Bohannon said, oh, holy god, what is this talk about? It's a lot going on in the talk. This is me. This is a very kind of ambitious, audacious talk to give. My head is very large right now. It's full of ideas to the point that it's distorted. We're gonna try to pull that down by the end of the talk and get some of these ideas out to the rest of you. These are folks who have been supporting me and helping me along the way, so I wanna say thanks to them at the top. But this whole talk started, if we have to blame anyone for my audacity and trying to predict it. Sure. Okay, so Ziad and I met at the Social Science Research Council. We're in this committee on digital culture and he kind of liked what I was doing with texting. He asked me this question. He wanted me to talk about this at Sage Publishing for their annual retreat. How does social science research need to engage with new tech, big data, and interdisciplinarity to tackle big challenges and contemporary problems? Big question. It's the only reason this talk exists is because I answered his question and then I figured, hey, I'll go ahead and tell people at Berkeley what I said. There are a lot of big problems in society right now. We have climate change, dysfunctional, political systems, dysfunctional economy, even to the level of our everyday lives, workplace, betrayals, mistrust, hypocrisies, all kinds of social dysfunction, like all kinds of problems around race and gender and class and all these things. These are all human problems. They're all social problems. And the question is, why aren't the social sciences actually solving these problems? We can blame the political class. They don't fund us. We can blame the hard sciences. They're really mean. They look down their noses at us and pretend that we don't matter. It really hurts our feelings. We can blame the public for being silly and dumb and not thinking thoughtfully when actually all of us have these two kind of thinking systems as Daniel Kahneman would tell us. We have a fast, intuitive thinking system and we have this slow, thoughtful system. People don't use their slow, thoughtful system as much as we social scientists would like. But really the problem is actually in social science itself. We don't have really clear answers. We have some theories about what goes on in the world and what causes these problems. We don't have really good data to prove our theories, to refine our theories and to actually solve those problems. Now, what I'm going to argue today is that the solution to all these social problems is going to require a grand alliance, an alliance between social scientists, software engineers and citizen scientists. People working from their computers, hundreds, thousands, hundreds of thousands of people out there in the public helping out with science projects. And I'll show you a little bit what that looks like. To create solutions in the world, you have to identify a problem. You have to design a solution. But good design requires having really good theory. Good theory requires having good research. Good research requires good data. Now, if you look around this campus, we have biological engineering. We have physical engineering. We have mechanical engineering. We have chemical engineering. We don't have social engineering. When people say social engineering, we actually think of like Nazi eugenics or something really dark and horrible. But by the end of this talk, I'm hoping that the most of you will think that social engineering, or maybe something we can call pro-social design, is actually maybe a good idea and something that we could get on board with. Okay, social science, first of all, is really hard. And I want us to be clear about that. This is Neil deGrasse Tyson. He can say it better than I can because, well, he's Neil deGrasse Tyson. Okay, and the reason we don't design is no one really trusts us to design. We don't trust ourselves and we haven't had adequate data. So our theories are not yet good enough. Okay, let's do a quick theory review from sociology one. Because we have some good theories in sociology and in the social sciences in general, but they need some work. So we have conflict theory. All history is a history of class struggle. This kind of Marxist idea that everything is, you're either dominating or you're being dominated. We have structural functionalist theory that basically says, well, let's just look at society. We have this political sphere. We have an economic sphere. We have a religious sphere, a cultural sphere. And maybe you blend into the functionalism when you start saying like, and don't they look nice together? Symbolic interaction is a theory that looks at what's going on with us every day. How do we build society from the bottom up in our everyday interactions? So when you think about symbolic interaction, think about the fact that when you go to get a cup of coffee, you know how to get a cup of coffee and they know how to give you the cup of coffee. There's a little script for that. When we do a talk, I'm doing a talk right now, there's some expectations. Like I'm gonna stand up here and talk most of the time. You guys are probably not gonna talk too much, maybe a little bit, but that's how it's going to work out. When we're in a classroom, when you're on a bus, whatever you're doing in your life, there's a little script that we're enacting and we're very used to doing that. And that's kind of what the social world really is. It's a collection of all these different scripts that we all enact when we walk around from place to place. This is great theory. And most people, you read this in graduate school and you're like, yeah, that's what's really going on. The problem is, we can't really measure it and put a bunch of data behind it. I should give some shout outs here to George Herbert Mead and Irving Goffman. And also, when we start talking about social psychology, which is kind of a little bit different than symbolic interaction, but blends well with it. We need to give a shout out to Claude Steele who's actually here on campus and does really important work. Okay, what I want us to do for a second is just observe our current symbolic interaction with what we call a dramaturgical approach. So this is common to the symbolic interactions theories that we think about things in terms of like an act and a scene and a setting. And these are the particular roles that people are playing. So if we're thinking about where we are right now, right now for this moment, I'm kind of one of the, I'm playing this important role of the speaker. Kevin was the introducer. We have a camera person, we have someone doing sound and we have a lovely audience. We have a lot of expectations about how this works. Like, you know, it'd be really weird if I just walked over here and stopped talking to you. I'm not supposed to do that. We all know I'm not supposed to do that. That's why it's kind of funny. It's a little bit of a gimmick. But we have these expectations and you have expectations about all the stuff that you're doing in your life. So the stuff of symbolic interaction is like the setting, the scene. People talk about a definition of a situation or even a working consensus between two people. Once we start having a conversation, we have a working consensus of what's going on here. Like, we're having a beer. We're talking about work. Maybe we'll talk a little bit about our personal lives, but not too much because we don't know each other. It's gonna be okay. And we enact different roles. Teacher, student, police officer, person who's uncomfortable talking to the police officer. When we talk about, I'm gonna talk a little bit about social physics or social psychology. And in that case, we're talking about the number of people who are interacting. So dyads, two people, triads or groups, things change when you have multiple different numbers of people in the situation that you're talking about. And the things that are really important are our expectations, our norms, who influences who, who has status. These change the way that ideas and all of our symbolic interactions occur. So let's think about this. Think about two, three, or 10 people having a conversation. Just imagine it in your mind. It's a different kind of thing. If it's two people, you might get something pretty intimate going on. If it's three people, you might be wondering like, oh, well, can I say that in front of that person? If it's 10 people, it's probably more of a party atmosphere. Imagine 10 people cooking in a kitchen versus two. I'd rather be with two people cooking in a kitchen. The point is that you can start to divide up the world and see the world based on situations and the number of people and the type of people who are in those situations. We can think about this to use an analogy as a sort of grammar of social interaction. So when you read grammar in a sentence, you're looking for the content of the words, but the fact that there's a subject verb and an object actually helps you guide, helps guide you and helps you understand what's going on in this sentence. It's not just a random collection of words, but they fit together in a particular way. So syntax is the semantics as kind of situation is to the activity in which that's occurring within that situation. Okay, yes, yes, this is nice, whatever. Very cool. But how can you measure all of that? How can we measure all of this stuff of our social world? We can't put people in all of these experimental settings and there are so many different variables and this is what's really been keeping social science kind of, I don't want to say keeping it down. It's been lagging because it has lacked good data. But what I'm so excited about in my work and just seeing other people's work is that there are these really great new data sources. Data sources that are cheap, they're plentiful and they're actually kind of recording what happens in the real world. They're not experimental settings and the data is just out there. So we can talk about data that we can get from sensors from mobile devices that are attached to people like maybe your cell phone or text data. There's just terabytes and terabytes and terabytes of text data out there in the world and these are records of what humans are saying to each other what they think about things or how they think about things. And then there's so much video data and then we also have this new kind of data sources of these online communities. Twitter is an online community, so is Stack Overflow, so is Reddit, Facebook, whatever. These are humans interacting, they might be interacting in somewhat different ways than we interact here in the physical space but this is real human interaction, we can learn something from it. The other thing that's exciting is this whole data science thing that's going on. When I first got started in data science, I kind of thought it was what, I heard someone say this the other day and it seemed pretty apt. What is data science? I do statistics on a Mac. But there's actually something here and I'll talk about that more over the course of the talk. And what it's really about is getting the most out of your data and getting the most inference you can out of your data. Okay, so first, and when we're talking about new forms of data, I wanna talk about sensors and what we call sociometrics. So there's this guy, Alex Pentland, goes by Sandy Pentland at Harvard who has a really great lab. And what they're doing is they're modeling people using these sensors, these little badges that sit on the person and they take very thin data on interactions. So they can know that Kevin and I are in the same room and that Stefan's in the room here too and they can know who's in the room. It's not really recording what we're talking about or what we're saying but you can actually get a lot of information about this. So think about what they're kind of doing here if you know anything about ants and the social interaction of ants, individual ants, very, very dumb. You wouldn't expect them to be smart, they're not. But as a community, they do some amazing things. They have these huge land building projects and they take other species and they impress them into work and force them to do their bidding and they build these really complex societies even though each individual ant is dumb. There's something about the interaction and actually I think they're using some kind of chemical trails or pheromones to do it but if you study humans this way, it turns out even though we're not as dumb as ants, arguably, you can learn a lot about what humans are doing. I want to give a shout out here to Gaye or Ximel who actually first came up with social physics. As great as Alex Pentman is, Gaye or Ximel started it. Okay, so this is what it looks like. You have, this is just a schematic of a workplace. So they had badges for each of these people and they have a number and they can kind of track what they're doing all day long and you can learn a great deal. So they found that in negotiation and decision making, kind of a prosodic emphasis, the rate of your voice and like conversational turn taking and so forth, they predict about 30% of the variance in individuals' outcomes. So you might get more or less raised depending on the style of your speech as opposed to exactly what you're saying. By looking at how people talk to each other in the schematic, how they move around, they can derive the influence networks of that whole community. They can see like, oh, this person is a central node and a lot of ideas come from them. They can see something like that or they can see how exposure to pure behavior influences individual behavior and then you can also take something like these sociometers and you can bring them into something like a board room meeting and you can see that in board rooms where everyone participates equally, you get much better outcomes. There's kind of a higher group intelligence and creativity and then moreover, if you actually tell people in those meetings like, hey, you need to back off a little bit and let this person speak, it actually improves the outcomes for everybody there. So you can improve social intelligence by using stuff like this, even though it's gathering rather than data. If you read Pentland's work, you'll see this phrase over and over and over and over. We're just scratching the surface. Well, the other phrase you'll see is we formed a spin-off company. But this is the one that you see the most. We're just scratching the surface on this and it's actually, it's really true. The data collection sites that they've looked at so far were a dorm network, a grad school network, a workplace network. They did this pretty cool one where they had all the cell phones from the Ivory Coast. Now they're not getting very thick data on that but the point is there's a lot more to be done here and it's really, it's a new field but what it's doing is field research into the kind of social psychological research that's existed for quite a while, research on contagion, diffusion, influence, all this stuff that we have a pretty good handle on ourselves but we haven't been able to back it up with data. Now the thing about Pentland's work is we probably want to know about the content of what's happening too. So to understand that, and I'll come back to how you can marry these two things together, but the place where we can really understand the content of what humans think and say and do is by looking at textual data, whether it's newspaper data, the congressional record, huge archives all around the world, going back into history. What's cool about this data is that it's about humans and it's not some experiment, this is exactly what people said and thought. It's really cheap to acquire and we just have to figure out how to parse all of it. How can we actually analyze it? There's some cool automated tools, I won't go through these but there are some concerns about how useful they are, how valid they are. We have to remember that computers do not read natural language. Our language is full of all kinds of ambiguities and they read numbers and we have to teach them how to read natural language. So this is still a front here but it's a quickly, it's being settled rather quickly. So they can't do things like they can't identify, oh here are the boundaries, the temporal and social boundaries of some event or situation, like this talk, they couldn't find that. They can't sequence actions. They can, we've trained them now that they can look for a subject, verb and object so that they can say this person is doing this to this person but they can't go through and say this happened before that, happened before that, happened before that. They don't recognize metaphor or sarcasm like whatever and they don't really do very well with gendered, racialized or other culturally contextualized language. They're probably not gonna work for your research by themselves but there are ways that we can use those. There are ways that we humans can bring our intelligence to the table and work with those together. So what I wanna talk about now is some of my work, I call it the new event analysis. So for quite a while researchers have been looking at events. I look at events of interaction between police and protesters but you can look at any sort of event that happens over and over and over. It could be a wedding, it could be a dinner party, academic talks or whatever. Shout outs to people who have done some of this stuff. So what's the kind of grammar? If we're using this conceit of grammar, what's the grammar of a protest movement? So, well you've got actions, maybe just a one-off action sometimes or maybe two people interacting together, maybe a chain of actions that could all happen within an event, this yellow circle. And if all these events chain together, like it's an Occupy movement that's doing event after event after event and they're sensibly linked, you can talk about a campaign of events. And then all of this is happening in the larger context of a city. To get data on all of this is like, it's something that for the most part, social scientists have not even really tried or attempted to do. I tried to do it because I was ignorant of how hard it would be. Unfortunately, the technology helped out and a bunch of people helped out and we managed to do it. But this is pretty new. What people have done thus far is they've looked at events, just the yellow circles, and they've looked at the yellow circles one at a time and analyzed them as if they were separate things. They weren't able to conceptualize, they were able to conceptualize them as campaigns. They weren't able to measure them as campaigns as linked events. Okay, so the day that I look at to get all of this information about these actions, interactions, events, campaigns and everything, I looked at over 8,000 news articles describing the Occupy Movement of 2011. 184 different Occupy campaigns across the United States, thousands of separate events, thousands of separate actions. And then I had data on the cities themselves. So like I said, humans, I said these algorithms, automated algorithms cannot find events. They can't find sequences, they can't find stuff like that. You have to have humans do that. So that's what we did. I had a team of, at the highest point, we had 14 people on the team, but we had something like 20 different undergraduates in 2012 and 2013, going through these 8,000 news articles, and what they're doing is identifying events. They're identifying the boundaries of particular events. So without reading this text, I can tell you because I know how it works, that blue text is about a police-initiated event. It could have been the police raiding a camp or giving them a warning or something like that. This orange event is about a protester-initiated event. We don't know what kind of event, we don't know what they were doing, but we know that they initiated it. And we also did some hand-coding or labeling for the stuff in the brick color is like goings on at the encampment. This is some other kind of stuff that you can look for. Okay. Now what we do with this is we do something pretty cool called performance modeling. So if we remember back a couple of slides ago, you might have noticed that some of these actions, these interactions appear to be roughly the same thing, and that actually fits really well with our symbolic interactionist understanding of the world. So we have these routines, we have these performances that we enact over and over and over. I'm giving you a performance called a talk right now. If someone else gave this performance, it would be a little different, but it would also, it would include some basic things like a person standing up in front and an audience listening. So what we wanna do is we wanna understand what these performances are so we can understand how are the actions and activities of the Occupy movement changing from event to event to event. You can look at police activity as well. So we did this in a pretty fun way. Some of you might be familiar with topic modeling. Topic modeling is used by people who want to read 10,000 documents at a time with a computer. So what they do is they feed in all these different documents to the computer and they say, find the latent kind of topics of interest that are happening across all of this text. And the computer does that by establishing term counts. So each document is basically a matrix of all the different unique words that occur in that document and then a count for how many times they occur. And then it looks across all the documents and does sort of a clustering at the middle level between the document and the terms to find out what those topics are that are occurring in a document. We did something a little bit different. It's called performance modeling. So in our case, the document is not a full news article. It's just that text that's about the police initiated event or just that text that's about the protest initiated event. So now we have a document that conforms with an event. We also have all these words that are talking about the activities that are occurring within the event. The subject verb object triplets that are happening within the event. And when we do LDA topic modeling on that, what we get as output is an understanding of what these performances are. It's kind of a, this was the pipeline. Sure, it was a lot of work, but don't worry, everything in green was automated. Everything in orange is just data. That's okay, it's not actually work. The work was everything that's in red and it took some time, fine, but it went way faster, 10 times faster than what people have done in the past. Okay, I'm gonna breeze through this for the people who are interested in how we did this. We extracted subject verb objects. So we would take a sentence and we used Clausy. I should give a shout out here to Aaron Kulich. I don't know if you wanna raise your hand, but he was instrumental in helping me do this. And we extracted these subject verb object triplets. We concatenated them. We appended them to the event text unit. And then we did our performance modeling and I'll show you some of the output of that in a minute, some of the findings. But the upshot in general for social science is that if you can identify situations or events in text, you can use this performance modeling approach and you can find out what are people doing across these different sorts of events or situations. And you can start to see regularity and normality and you can actually measure this stuff of social interaction. Okay. Once you know what those performances are, you can do something that's really fun. So I told you that I also had data on what was happening at the city level where all of those Occupy campaigns were happening. So I know what the city political situation is. I know when their elections are coming up. I know what their police department capacity is, how much money they have per citizen, how many officers they have per citizen, whether they have a community policing philosophy, all these different variables. And I can regress those variables as independent variables to find out what the likelihood of a particular performance is for a particular event of the movement. I can also use time in that analysis and I can do something really cool where I can show you over the course of the Occupy campaign from day zero to day 100. And most of them didn't last longer than 100 days. I can show you that cities where an election was coming up, these green lines, they were more likely to do a performance that we discovered, a performance called ordinance enforcing, giving out a bunch of tickets, a bunch of citations. There's an election coming up. We don't necessarily want to have a big group arrest where we come in with our officers and arrest hundreds of you at a time. That would end up in the papers. So we do less of that, the green line. We do more of this sort of ordinance enforcing. And we do more of these individual arrests where we're just gonna pick people out one at a time. Okay, maybe I should have told you what the literature has found so far. If we remember back to my slide where I had events as the yellow circles and that was all you could work with. All you could work with was the yellow circles. What the political science and sociology right now, until this is published, should come out soon. Right now, the going theory is that police do not act strategically. They do not act in any way, in a way that is colored by the political environment in any way. They only react to whatever protesters are doing. So when you only have data on events and you see that the police were violent and you see that the protesters were violent and that's the only data you have. You don't understand it as a campaign. You don't understand the interactions that are happening within the event. You end up producing in the top journals of political science and sociology, findings that say police just react to protesters. Not true, not at all. We also found that departments that are kind of understaffed really reacted quickly to the Occupy Movement to try to shut it down really early. We found that departments where there was a high violent crime rate kind of looked at the movement and were like, nah, not that big a deal. We have people murdering each other all over. This is not a big deal. We found that departments that were really committed to community policing, that actually carries over into the way that they interact with protest and they are more accommodating and they treat it better. So there's a lot of findings here that were not possible because we didn't have the data. Now we have the data. Okay, I love the data that we have but I want even better data. In an ideal world we would have humans kind of ensure the accuracy of every last piece of data we have there. Again, these automated algorithms aren't great for everything. So I want to incorporate humans into the process to make it even better. But if you've ever done one of these projects they take 10 years. The Dynamics of Collective Action Project which was pretty comparable to mine. Required six professors over 100 undergraduate students and it took them 10 years to complete. Not something I want to do. So the big problem, if you talk to people who have done this kind of work before where they get everything done by hand and by everything I mean like as much as you can because actually they didn't get everything. They just got a fair amount before they pulled their hair out and quit. The big problems are these workforce bottlenecks. You're working with undergraduates who like me really like to gain some sort of mastery over something and then go do something else. So as soon as you train them up they leave. They go play Ultimate Frisbee like I did or they graduate. Now there's another possible workforce which is new. Citizen scientists, crowd workers, people who on their computer if you can detail the work for them in the right way they can actually do the work for you. So this is what I wanted to look into. Can they actually do these kind of hand coding of text? And in so doing I kind of developed this new crowd content analysis assembly line. Let me define content analysis. Content analysis is, well you can read it. Can everyone read it in the back? Okay, it's just systematically replicably pulling out information from text. So what you end up doing is you create this big scheme by which people go through with highlighters and people have been doing some sort of content analysis for a long time. I get started with marginalia notes and old Bibles. This person's really getting after it. They got all kinds of notes. They got different color highlighters because it means different things. Here's someone I, that's probably not systematic. That's, and lately in the last decade or so maybe a little more than a decade they've moved this onto computers and that makes it kind of better. But you still, and I'll just get back to that. We trust humans to do this because if, well it's a time when we use language there's a pretty good correspondence in our understanding. If I write everything out very, very clearly and I train you and I stand over your shoulder I can believe that you're gonna go through and label those documents the way that I would do it as a researcher. So that's why we kind of like that process and we trust it more than these automated approaches but the question is can it scale? So let's think about this. I had over 8,000 news articles about the occupied movement and I didn't want to get just like 26 variables out of there like some, like the Dynamics of Collective Action project. I wanted to get like 137 variables. I was really greedy. Like I want to know every last thing that's going on and this is my coding scheme. This is where they call it a labeling. They call it hand coding. This is my coding scheme. This is everything that can happen in a movement. I've tried to think about it and I, you know with my team and we got it all down on paper. These are all the things that are important. So we want to figure out a way to get somebody to look, to know, to read through articles and know that they need to look for all that information. Seems wild, crazy, probably not going to happen. The existing software is not really set up for that. It's, you have to train people into using it. It's a relatively foreign process like learning this coding scheme and then applying it to all these documents. It's just not, it didn't work. And so this was really concerning because if I look at this other project and they only had 26 variables and took them 10 years, I want to know everything. This is not a good idea. So I was trying to figure out some way to work with these crowds and here are all the reasons it wouldn't work. I was told by many people. These people have to be trained. Training a thousand people is not feasible. You can't get all these people to do this. They don't understand our scientific concepts. Can't expect them to read a whole news article. Come on. Can't expect them to reliably apply that coding scheme at the work is just too cognitively complex. Okay, okay, I did not give up. I did not give up. That's why I'm still standing here. That's why I'm at BIDS because I did not give up. Okay, here's the problem. Okay, we have all these things. Okay, okay, okay. Very hard. How are we gonna do this? Well, if you go to the computer science side of campus and you talk to anyone about crowdsourcing, they're like, you need to do task decomposition. You need to decompose that task into some kind of assembly line, break it out and then maybe it can happen. I said, okay, that's a cool phrase. Task decomposition, I can do that. So it seems like there are two ways to do this. You can reduce the amount of text they have to read or you can reduce the number of variables that they have to think about while they're reading through that text. That's pretty much the only ways to do this. So we figured out how to do both. Okay, remember this is what we're looking at. Now there's something really interesting about what's going on in the right here in my coding scheme. What's going on there is that, and I'm gonna talk about this as a two-stage approach to Continental, what's really going on there is one of those pages is about everything that could happen at a police initiated event. The next one is about everything that could happen just at the camp. Another one was about everything the city does. We had another one that was about what could happen at protest or initiated events. But each of these, these are like different branches of this long coding scheme. And when you start to think about it that way and think about what we've done over here, we can basically say all of that blue text has to do with this stuff. All of that brick color text has to do with this stuff here. But the brick color text, none of this stuff is gonna pertain to it at all. So actually what we've done now is we've gotten to a point where because we found those events by hand, we have shorter text units and instead of 156 variables for that text unit we have like 12 or 14 variables that can be pulled out of that text unit. We made that branch of the, that piece of the coding scheme smaller. We made the reading task smaller. So we did it. I mean, we decompose the task. Great. Now how are you actually gonna get crowd workers to do that? Turns out you have to build software because the software doesn't exist. So that's what we did. We started building software. And it's called text thresher. It's apparently not a good name because I have to explain it. A thresher if you'd ever grew up on a farm, is a piece of agricultural machinery that will separate the seeds from the plant material that you don't want to eat. So we're separating the seeds of important text from this, yeah, it's not a good name. Okay, but it got funded anyway. I worked with a software developer. I don't write software, but I was able to work with someone to create a prototype and we're now in partnership with Hypothesis and the Alfred P. Sloan Foundation. I wanna put a logo for the Berkeley Machine Shop because the Berkeley Machine Shop has been incredibly helpful. Shout out to my friend Steph on there in helping us move through this. What's so great about text thresher is it allows you to get people to do this content analysis while avoiding training them face to face. I love people, I would love to train them, but you can't train 1,000 different people to work on your project. So what we do after we have those text units that are about just this event or just that event, we set up a piece of work for them that looks like a reading comprehension task. The same kind of thing we've been doing since middle school. Anyone can do it. What you do is you read through and you answer the question and then what you do that's different about text thresher is you highlight the text that you used to justify your answer. So it's reading comprehension plus highlight the text to justify your answer, which is probably what schools should be doing anyway, just side note. Okay, so yeah, it's just like in junior high school. And what text thresher does is it will guide people through the shortest possible list of questions. So a lot of those variables, you see there's kind of a hierarchy. If they answer yes to this question, like if there was a blocking action, only then do we ask the next question, well, what did they block? Was it the sidewalk? Was it the port of Oakland? Was it the street? So we can go through the shortest number of questions. And here's a wire frame of what it looks like. So maybe we have in this black text is something that's already been highlighted by our crew. And someone goes through and they answer a question, they highlight the text in blue that they used to justify it. They go to the next question, they highlight in green the text that justifies their answer to that question, and so forth. Bam, bam, we did it. Okay, now this text thresher is, it'll be out soon. It'll be out soon. So soon, if I've been saying that for a while. But this is my first time leading a software team. So give me a break, a little bit, just a little bit. Okay, now we don't trust every last crowd worker to give us the perfect answer every time. So what you do is you get, for every question, you get like eight of them to do the task. If six of them agree, you're like, yeah, that's pretty good. That's good enough. That's good enough for scientific purposes. So you get them to kind of vote on what the right answer is. The other thing that's really cool about this is you have labeled text that corresponds with a variable or an attribute. Then you can actually train, you can go back and train a computer to try to label that text the way the humans have. So the long term implications of this are that you can train a computer to read the news and find out what's going on with something like the Occupy Moomint or the Black Lives Matter Moomint or whatever the next movement is. Okay, oh, we also have this really cool thing. I gotta give a shout out to Manisha, one of my underguides, where we basically use those automated algorithms to kind of give hints to the crowd workers. Like if they're having trouble with something, we can say, according to our algorithm, we think this might be the answer. If you're looking for the number of people or the dates or something like that, the stuff that the automated algorithms can do, we let them do, and we put them into a mini work environment with the humans. Here's the long game of this. The long game of text thresher is that whatever kind of text data you have and whatever kind of research or coding scheme or theory about what's going on with that data, you can feed it into text thresher. You have the text, you have the scheme for what you want to get out of that text, and in the end you get this rich, contextualized, transparent, qualified, beautiful, just the best data. At the same time, you're building these algorithms that can do that without the humans. They're not gonna be great at first, but they'll get pretty good and someday they'll be really good. And so someday they'll be so good that you can take whatever your text data is, there will already be an algorithm ready for that text data and you get the great data in the end. Other people talk about this and they're like, oh yeah, that's artificial intelligence. That's really good. That's how you build artificial intelligence. And I'm like, I guess that is. I guess that's right. I guess that's how you do it. Okay, there's so many things you can study with this. You can study, you know, you can classify the speech acts of every single person of Congress while you look through the congressional record. That's what we're doing at the Computational Text Analysis Working Group at the D-Lab. You can look at how the agency of some particular subject, like how much are women or African-Americans or veterans treated as the object of a sentence versus the subject of a sentence. You can look at that over time and see changes in the evolution of these different identities in society. You can parse Craigslist ads and do a sociological analysis of how horrible it is to try to find a place to rent in the Bay Area. And some braver person than me could look into all that WikiLeaks data or the Panama Papers and they can see the chains of interaction that create our dysfunctional world. Pretty much any study that you wanna study in this dramaturgical way where people are coming together, there's a particular situation. Dating campaigns. This is the stuff that we all talk about and we all struggle with. Classroom observational studies. Teachers don't wanna teach to tests. They wanna replace testing. They wanna say, look at this holistic approach to the classroom and look how these students and the teachers and everyone are working well together. We can do that if we have observational data, if we have text data, okay. And at this point, you're like, wait, wait, wait, what about the, you know, we don't have that. Okay, well, this is kinda cool. So if we have the sensors that can learn all of this stuff with very thin data and we have this capacity to understand text, it'd be really nice if we had both of those together. That would be the coolest. Well, it just so happens, this was really funny. I was thinking about this and then on Tuesday, there was somebody here who was giving a talk about how they're creating this app that works on your cell phone that will record whatever you're saying and if you have a group of several people, it'll take what you're saying and it'll turn it into transcripts. They're originally doing this for people who are deaf or hard of hearing. I changed the title of the talk because they tried to use the buzzword machine learning, which is correct, but really what's most interesting about this is that they have these sensors that convert speech to text while they're tracking the speaker. So you could, literally, you could have that kinda sociometric sensor data and you could also have text data and you could have people parsing the text data to really get an understanding of what's going on in our social reality. There's related work that's happening here on campus. Nicky Jones has a ton of video data of police encounters with citizens. Wouldn't it be great if we could take a text-thresher approach to video where you have people saying, okay, here's a particular scene. Here are the interactants. We're gonna code up all that information and we're gonna have data on that and then we can compare across hundreds or thousands of police encounters. Start to see the sequences of interaction that lead to things going poorly, things going well and here I should give a shout out to Brent and Sykes, who did some work, a very small scale on this a couple of decades ago. There's also Colin Baker here on campus who works, he's leading the FrameNet project, which as he describes it, it's kind of an artisanal approach to theorizing these different frames, these different situations. He calls them frames, situations that people find themselves in and the sorts of words and things. So he and I are talking, we just met at a party randomly, we should have met years ago, but we met at a party randomly. Yeah, Colin and I go to the same parties. It was a baby shower. Okay, key to all of this is teamwork. People, you have to get, like I had to get dozens of undergraduates to help do this hand coding to start out with, but we also need all these crowd workers out there and we also need this alliance between social scientists and software engineers and people who know how to do crowd work. BIDS is a place where that can happen. I am still not a software developer, but I've actually, I can, without lying, I can claim to be something of a software architect now and that's only because BIDS happened. I had all of these ideas to make this stuff and none of it was gonna work if it weren't for a place like BIDS. Okay, now I wanna add to my resume in case things don't work out, I tried to put together an animation. If things don't work out, I'll go to Pixar. Okay, this is kind of my model of what data science with crowds looks like. So it starts with a social scientist kind of genderqueer, glasses, thinking things and they have this theory of the world. Right now this theory is represented as a decision tree, but they have this theory of how the world works. If we're in a protest event, then these things are possible. You could think about that. If we're in a police initiative event, these things are possible, you could think about it that way. So what they need to do is they need to specify, they need to take that thought, they need to bring it out into the world, they need to specify it, they need to get it into a computer and then they can start to do some really cool things. So there's all this data out in the world and then there are these crowds of people who can help us process the data. This is what happens, I'm so embarrassed of this but I kinda love it, I kinda love it. So the crowds help us process the data and what happens is the data kind of flows through and then you start to see like, okay, I had this model of everything that was possible and this is kind of where we are in social science with our symbolic interaction. We know what's possible. If you read through all that stuff and then you look out in the world, you're like, oh yeah, Goffman got it right. Oh yeah, Garfinkel, he really figured that one out. But what we don't have is this data. So what we do is we get the help of the crowds and we have the data flowing through the system and we start to see actually, this thing doesn't happen much at all, neither does that, it's possible. It's in our model, it's something that's possible. It doesn't happen very much. With data we start to see like, okay, what's really going on in the world looks more like this. And then what we do, we take that information and we incorporate it back into our model. So now we have a model that's not just what's possible but what's likely, all right? And then you have a model that says what's likely, you put that into your machine, you get more data and you just keep the process going and you get better and better and better, better understanding of the world. So the point where when you have really great theory, you know what happens when you have really great theory? You talk to the physicists, they'll tell you what happens. The biologists will tell you what happens. The chemists will tell you what happens when you have really great theory. You can actually do something with it. You can actually start to design, how are we doing on time? You can start to design solutions to problems. What time is it? It's about two, okay. Because this is a point in the talk where I can skip this part or I can go, I think I can do it, I'll go fast. But we get to a point where we can actually design things. Now I talked about sensors, I talked about text. There are some other data sources out there that are really interesting that can help us solve problems, help us make the world better if we can understand them and build models based upon them. These online communities, Stack Overflow, social media, Wikipedia, Second Life, these are actually human beings actually interacting even if they're doing it in some particular way, in some way that's constrained by the interface. It's actually interesting in its own way. There are people who are looking into US Congress on Twitter. This is the social media and political participation lab at NYU, I was there last March, hanging out with them. They're doing some really neat stuff. So they ask some questions like, does slactivism help or hurt movements? Slactivism is like, I'm not gonna go out in the street but I'll like something on Facebook. And people are like, that's not real, this is a matter. Well, if you look in the data, it actually helps. It actually helps. The people who are gonna be in the streets are gonna be in the streets anyway. But if you have a bunch of people doing this stuff on Facebook, it actually raises people's consciousness about whatever you're doing and it seems to get more people on the streets. Twitter and Facebook, are they more than just echo chambers? Everyone says they're just echo chambers. Well, there's some echo chamber stuff going on but there's also more going on than that. And if you look at how they work, you can see that there are ways that we can move people away from just doing echo chamber-like behavior. They're also at that lab detecting bots, Twitter bots that are doing kind of propaganda for the regime, they're identifying terrorist networks. And at some point, we wanna link their social media data to my data on events to see how these things interact. There are other people who are looking deeper into the content of what's happening on Facebook and Twitter. And they're finding really interesting things. So it's really easy to derogate kind of the millennials and Generation Z as doing these weird, like fluffy things on Facebook that don't matter. But what's actually happening when kids go out on Facebook and have these weird interactions is they're kind of trying things on. Are they doing a lot of code switching where they're speaking this way to one audience and this way to another audience? Stuff that politicians do all the time but we can look at it in real time and we can learn something about how a different messenger speaks to a different audience. And then there's also really interesting things about new online communities like Stack Overflow. We see there that there's a different kind of pro-social behavior than we might get in our everyday lives. Where people, the pro-social behavior is more contagious because when I help someone on Stack Overflow, let me be honest, when someone helps me on Stack Overflow they're also helping about a thousand other people who might have the same question I have. Quasi-anonymous discussions are more egalitarian because they're not speaking at you with a particular identity that must mean something according to my stereotype and I don't have to enact the stereotypes about me because we're just focusing on the ideas. There's more unique ideas and more open exchange, more questions, more self-disclosure, less fear, more social trust. These are things that can occur in an anonymous or quasi-anonymous kind of discussion. It's different than our normal world but it's still very interesting and it could have implications for design if we start thinking that way. So we're at a place right now or we will be in the next few years where we can have this really rich, great data to understand the social world and we can build really great research projects of it and we can have amazing theory. Can we do design with it? It's not something where it's almost taught not to do here at the academy. Okay, so the first thing we have to do is we have to make sure we never call it social engineering. I recommend pro-social design. Maybe social design is fine, but just to be clear, we're trying to do good things here. And there are a few different classes of things that I'd like to see out of a pro-social design agenda. I'd like to see us optimize what exists here in the world. What interventions can we figure out to make our lives here and now better? What kind of things that exist in the offline world need to be moved online where maybe we behave better sometimes or behave in ways that are more charitable to one another? And then a pretty interesting, I hope, is this idea of layering. So I said up front that part of the reason people don't use social science is because when you're living your everyday life, you don't have time to go back into your card catalog and consult with Goffman and Schutze and Weber and everyone to come out with the right answer. You're acting quickly and what would be really nice is if we could bring our knowledge of social science directly into the world in a way that we can use it in real time. The idea here is to get better information flows. So Pitland's work showed that they could reduce echo chambers. We could have better roads, we could have better classrooms, safer, less violent police protests. That's a big thing for me. And a better, more responsible media. I'm working on a project right now with Saul Perlmutter of all people to use text pressure to have people reading through the news. And instead of us just having to read the news and agree or disagree with someone's argument, if we have people using text pressure and they go through and they identify very common, argumentative, logical, formal fallacies and even fallacies of scientific inference, we can actually get to a place where people can judge the news not just on whether they agree with it but does it actually, is it rational? Does it make sense? Does it, we have some scientific measure of the quality of that information? Eventually we can have better, more responsible news media with something like that. When I talk about from offline to online, there's some really great research showing that you can get people to deliberate and have conversations where they actually produce really fruitful, constructive outcomes as opposed to a very adversarial kind of debate which we have every four years and we kind of forget what they said and then they say it again and then we do it another four years. If you have everything on the same page, if you have everything recorded and you can deliberate out to the nth degree until we get to a point where we're like, okay, we've talked all the way through this and now we can go through and read things and say, yeah, this makes sense, this doesn't make sense, this makes sense, this doesn't make sense. We can use a crowd kind of wisdom to build up deliberations about what we should do next, whatever, however we wanna define that. When I talk about layering, this is what really excites me. The idea that excites me is what about, instead of a wearable technology that tells you like, hey, summer's coming up, how about a wearable technology that can intervene in our moments of most difficulty? You're in that boardroom meeting, something weird happens, there's some microaggression. It'd be nice if I could say, hey, I saw that, you're cool, that's gonna be all right. Or it'd be nice if someone could say, it'd be nice to have a little ally right here in this moment. And I could just, very discreetly, I saw that, you're good. Or maybe the phone itself, or maybe the wearable technology itself can give you hints in a social situation. Maybe you're a somewhat socially awkward person or at least everyone has awkward moments. I love awkward moments. As a social scientist, I get to really sit in them and stew in them and enjoy them. And if something bad happens, I can be like, that was just a social experiment. But most of us don't enjoy that. We want some way to get clued in that, oh yeah, there's actually 10 years of research that says what just happened. And this is why it's awkward and this is how you fix it. We can do things like that. We can design things like that if our theory is good enough. If we have good enough data to know that, oh, you're with three people and you're at a dinner and this was said, 90% of the time, here's the problem, here's how you correct it. That's what I want to see. Okay. This was supposed to signal something to me. I don't remember. Okay. Ah, it was, it was, it was to, we're getting to the point in the talk where I'm going to start letting, I want to pull you all into the conversation. So I want to think about how do we encourage this sort of innovation? How do we encourage more people to do this sort of social science? How do we encourage more people, software engineers and people from the north side of campus to get together with social scientists? And how do we discourage some of our institutional conservatism and incrementalism? The way I see it, we don't have that many years to figure out what we're doing here on the planet before we destroy it in ourselves. So it's time to act. So there are some good innovation models out there. There's a, well, I guess I had a word here on the academic conservatism. So there's kind of conservatism from two standpoints here in the academy. One is sort of the tenured faculty who, if they want to innovate, if they want to do something new, it requires a lot of courage to just put themselves back into beginner status and say, I'm going to start learning how to work with sensors or I'm going to start working with text or I'm going to pick up that old, I'm going to dust off my old Irving Goffman and figure out how I can take all that theory and put it into a formal model. That's not easy. I don't see many professors who want to show up at a D-Lab class and learn something side by side with graduate students. That's a problem that we need to figure out on this campus. If you're early career, this is really hard. I'm in debt. It was not fun to get to this point. I'm really glad I'm here and there's a lot of exciting stuff going on, but you take on a lot of risk and people look at you like you're unproven and you're unworthy of funding and based on the way that we do credentials here, I was. I hope I'm not anymore. And the innovation desert is wider and it's drier since the economy did not do so well, we'll say. The other problem is, if you're really good at this stuff, you can easily cash out and just go across the bay and then you can actually have a house maybe someday and maybe a family with children, wouldn't that be nice? So there's a lot of problems here at the university. There's a lot of conservatism that needs to be overcome, but there are these models of innovation. I think that these slides were mixed up. There are these models of innovation to consider. So at Harvard, they have this Institute for Quantity of Social Science. It does really cool work, but it's mostly with faculty and getting funding. Harvard's really good at getting faculty together with funding. That's a good idea. We're not so bad at it here. We've got a new place, the matrix that might help out with that. It's over on the top of Barrow's Hall. But we are doing something right. I should talk about what we're doing right. The D-Lab is fantastic. In many ways, it's even better than IQSS because it creates a space where graduate students and also faculty, they should feel very comfortable there, although they don't show up very often. It creates a space where you can learn something quickly. You can take a short workshop to learn R, or to learn Python, or to learn text analysis or visualization or geospatial data, whatever it is. And it's all free. It's taught mostly by advanced graduate students. So they're also professionalizing and learning. And we have these really great working groups like the Computational Text Analysis Working Group, where people can come together and they can, this is a model that I really hope catches on. People come together and they have a common data set. They have a common set of code that they're writing together. They develop that code into tutorials that other people can use so that they can learn and follow along in the footsteps. We have our own library, our own teaching curriculum, and we're trying to develop ways to spread credit around the group, citation and so forth. This is something that is happening for text. I know BIDS is trying to make it happen for a video or a visual data, or am I saying it wrong? Close enough. And maybe we need to be doing it for machine learning or artificial intelligence on this campus. But we need to find ways from the graduate students all the way up through the faculty and funding to have these big projects housed into one group. That's what I would do if anyone's asking. The other thing that's great that everyone should be noting is once you figure out how to do something, we should have a lot more incentives for once you figure something out, share it with the public. Show, use something like NB Viewer to put up your Jupyter Notebook. A Jupyter Notebook is amazing for people who don't know. It's a way that I can write in my nice English prose what I'm doing and then put in snippets of the code that do what I say I'm doing. And you can run it straight from that notebook. And you can play with it, you can change a code, you can see if things work out differently. This is a really powerful tool that's developed right here at Berkeley by someone who, in a similar situation, came through this university and tried to make something of nothing, Fernando Perez, and he's made a whole world now. All right, I'm running out of steam. And this is the last slide. The good news about all of this is that we have everything we need here. We have everything and everyone we need right here at Berkeley to make this alliance. So I hope people will come away from this thinking about how can I get into this? How can I get into this alliance of social scientists, software engineers, crowd workers? How can we pull all of this together to get that big, rich, thick data we need to understand something as complex as our social world? And when we understand that complexity, not just understand it, we have pretty good understanding, we have pretty good theory, but when we can quantify it, when we can build models on it, when we can say, based on this sequence of events, there's a 90% chance that this is going to happen next. And this is how you can prepare for it, or you can avoid it. This is how we can design to solve our problems, design the future that we want to live in. That's what I hope people come away with. That's it. Thank you. The ideas are out of my head. Let's come back down to size. Thank you all. We can do questions, right? I'll just repeat the questions. I'm just gonna repeat the question for the audio. The question is about the bias in these news articles and bias in general. When you're looking at texts, you're looking at discourse and there are all kinds of biases in it. I'll speak first for the news articles. So there have been quite a few studies of news article bias, and there tends to be not so much descriptive bias, like we're going to make stuff up and call things whether or not. There tend to be biases of omission, like they just won't talk about something. So, and also when you look through a news article, one of the things that we did, we didn't highlight, we highlighted the text about the protestor event. We didn't also highlight the text about like here's someone's opinion about what's going on. So we just focused on kind of the actions, the setting, the scene, the dramaturgical meat of it. But your point is absolutely right. So when we've gone, we use news articles at first. 8,000 news articles, local radio, excuse me, local, regional, national, radio, television and newspaper. The idea is to actually also bring in independent news and police reports. So we wanted to build the machinery first around just like what are people going to perceive as the neutral parties? So that's where we started. But you're right that we need to bring in more perspectives. We're not going to find things like agents' provocateur in a news article. But we might find it through Twitter feeds and independent media and maybe even some FOIA requested police report or something like that. When it comes to just discourse in general, there's like so many researchers who do conversation analysis, discourse analysis. I trust that they're going to know how to handle that. And I can't, I don't want to do it all. I want people to discover that, yeah. Other questions, yeah. Yeah, just like ecology. Okay, well, let me, I'm a sociologist. And so my bias is that sociology is fabulous. We're the queen of the social sciences. What I like about psychology, and I like sociological, social psychology, it's actually a thing. It's sociological, social psychology. And what I like about psychology is that it's done a pretty good job of kind of, kind of, what is the word I'm looking for? It can take everyone, it can put them in categories and put them in buckets based on their personality or something like that. The way that I would do psychology using this stuff is I would do something that's more like a Berduse in kind of habitus, like looking at how social structure and social situation affects psychology. So if you look at people's statuses and groups, a lot of times those things are modeled as like, well, you have status in this particular small group because of, you know, because you're the boss and someone else is your employee or something like that, or if it's a group of supposed equals, we don't really, we look at the status, but we don't wonder why the status occurs. What I would suggest, and this is just me spitballing hypothesis based on my old readings of social psychology six years ago. I actually, I think that personalities develop, you know, there's definitely something there when you're a baby, but this develops over and across situations and people's expectations of how they're gonna be treated and expectations of how they should treat others are kind of built and learned over time. In fact, if you look at Cecilia Ridgway's work, she actually shows a mechanism by which we get to a situation where white men are accorded higher respect by people across the board, not because those people have some kind of implicit racial bias, but because they've learned over time that whenever they go into a group, there's some white male who's at the top and over time they're like, oh yeah, well, just dumb animal brain, white males must be pretty good at stuff because they end up in all these higher positions. So there's a lot of different ways that psychology can learn by embedding itself in the social world more. And I think psychology's first project was to kind of create types and that was fine for what it was, but I would love to see psychology incorporated into something like this theory. Yeah. Okay, the question for anyone who didn't hear is, even if I wanna call social engineering pro-social design, there are still people who will use it for less than charitable purposes, casinos will use it, other people will use it to concretize their power, perhaps. This is a real concern, but the reason I'm calling for this alliance here is that we are the ones who need to have the control and the power over this. It's out of the bag at this point. I'm gonna publish the papers about what you can do with text. The guy who's doing the thing that converts speech to text, he's gonna do it. It's going to happen. So the question is, are we going to get together and build the sort of world that we want? And let's be very clear about something. Every bit of our lives has been constructed. It's been constructed over centuries, over millennia, and some of what we like, some of what we don't like. And the people who have power to construct the world right now are the people who are sitting atop particular systems, political systems, economic systems. What I'm arguing is that we geeks who know something about how the world works from all of our theory should team up and we should be the ones to, in a bottom-up way, create a better world. Can someone abuse it? Sure, but that's all the more reason for us to get together faster and make more good, people-powered, positive things happen faster before other people can get to it. Is that good? Okay.