 Welcome to Dave Rand. It's exciting to have him join us. Dave is a fellow at the Berkman Center, as well as a fellow at Harvard's Program for Evolutionary Dynamics. And he is an FQEB private fellow in the psychology department. He's going to talk to us about experimental social science on the internet and some of the experiments that he's run using mechanical turf online. Welcome, Dave. All right, thanks a lot. So this is supposed to be a discussion, so I'm going to say some stuff, and everyone feel free to jump in as much as you are so inclined. So a lot of what I'm going to talk about is in a paper called The Online Laboratory, Conducting Experiments in a Real Labor Market that I did with John Horton and Richard Zekhouser, who are both at the Kennedy School here. And it's an NBR working paper. And this is the direct link if you want the PDF. These slides are posted. But if you just go to my website, it's on there. So more or less, unless I cite it as being from somewhere else, that's where it's from. John is really the guru of Turk experiments around here. So he's awesome. OK, so I'll just start out with why are we wanting to do experiments in the first place? And here is a nice little cartoon from SKCD, which if you guys don't know, it is basically the best thing on the internet. So the problem is, so he discussed it, I used to think correlation implied causation. Then I took a statistic class. Now I don't. And this sounds like the class helped. He says, well, maybe. And that's the problem with so much of social science is that you see correlations, but you can't really imply causation from that. You can learn things. And often you can do useful things without knowing for sure that there's causation going on. But it's nice to run experiments to be able to really say, look, I manipulated this thing and something changed. And the only thing that's different between these two groups is what I manipulated, so I can really make some causal inference here. So that's good. But the problem is that you need some particular things in order to do experiments, in addition to good ideas. You need a way to recruit subjects and get them to participate in your experiments. You need money to pay them to do it. You need time to invest in getting them to come and actually giving them all the materials and collecting the data. And you need some way of having confidence that what they're sort of reporting is reliable. So in social psychology, they use smart tricks to set the experiments up in some way that people can't sort of fool you even if they want to. You can argue about how well that works or not, but that's that approach. In experimental economics, they try and do this with monetary incentives, where you make it so that the amount of money people get paid for participating in the experiment depends on the decisions that they make. So they're incentivized to not mess around. That's the sort of camp that I am mostly from. And then field studies are studies where people don't realize they're in an experiment. You just are taking some natural situation and manipulating something. And so they don't know they're in an experiment. So of course, whatever they're doing is what they're actually doing. And these are all sort of practical problems that are posed for trying to run experiments. But the internet can really help with all of these different aspects of experiments. It can make it really easy to recruit subjects and really a lot of subjects without ever getting up out of your chair or doing any work, really, other than posting something out there. And doing online studies is not anything new for psychologists. For years, there have been these websites where you can come and take surveys and say what you would do in hypothetical situations and all of that, although there's still the problem there of generating traffic. If you're Harvard psychology, people will come and check it out out of curiosity. But more generally, anytime you make a website, you need to get traffic somehow. But there's very few economists that have been doing online experiments because economists really like incentives. And you want designs where you can make the amount of money that they get paid, depending on what they do. And in general, that's hard to do on the internet. But that's where online labor markets come in. So online labor markets, there's a bunch of them. There are situations where you can recruit people online to do tasks and pay them. And they have payment infrastructure set up where you can give payments and you can specify exactly how much that you can make the payments performance dependent. And it also gives you a really easy way to recruit subjects. There's already a pool of people out there looking for work to be done. And I'm particularly going to focus on Amazon Mechanical Turk as my labor market of choice. But there are other ones out there. And there's not any particular reason, I think, that Turk is the be-all and end-all of it. It's just where I've been working. So the idea of Mechanical Turk, the original Mechanical Turk was this thing from the 1700s, where it was a chess-playing robot, a pretty good artificial intelligence sort of thing, where it could beat good chess players and do the nice problem and all kinds of things like that. And it turned out that the way that it actually worked is there was a person hiding inside of this box operating it. And that's a good way to get a robot to do things that are hard for robots to do, but easy for people to do. And so the idea of Turk is, from your experience as a what they'll call an employer or a requester, as an experimenter, it feels like this Mechanical Turk, because it feels like it's just a computer. You upload the task, and then in a couple of days or whatever, you download the results. And as if the computer magically did it. But really what's happening is it's going out and it's getting farmed out to lots and lots of real people out there that are performing the task. So that's the idea of that Turk metaphor. Mechanical Turk looks like this. So they have both workers that come and log into Turk on the interests of completing jobs to make money. And you have requesters that set up accounts because they have work that they want to get done. The basic model is you have tasks that are easy for people, but hard for computers. So things like giving relevant labels to images, transcribing text or audio, going and collecting data off of websites, completing surveys, this sort of thing. And they're generally short tasks that get paid very small amounts of money to do. And they're called human intelligence tasks because there's things that exploit human intelligence. And you give people a base payment amount for completing the hit, assuming that they do it well. And then you can also pay a bonus. And that's where you get into this performance-dependent thing. I should say also I don't have any affiliation with Amazon because I sort of feel like this talk is like an ad for Mechanical Turk. So I don't. But if they would like to have one, feel free to drop me a line. So the way it works is you open an account, you put money into the account, you create a description of the task that you want done. In general, the way that we do it is you have a redirect link. So if a worker pops up and says, OK, do this task, click, go to this other website, do something, and come back. So we redirect them to if you want to give a survey to your favorite survey site or to whatever piece of software you want to collect data from people using. And then you generate some kind of confirmation code and then send them back to Mechanical Turk with this confirmation code. And then you can use that to match up the data from your survey site with the sort of payment information from Turk. You say how much each person owns. You put it back and they get paid accordingly. And John Horton and some other people have been developing a sort of nicer integration so you don't have to deal with this confirmation code thing, but that's still in the works. This is the sort of basic workflow of running an experiment on Turk. It's just like running a pencil and paper experiment, except instead of going out on the street and recruiting people, you just send it up into the cloud. Does that make sense? I don't really know what level everyone is here about knowing about Turk. So maybe this was all repetition, but whatever. So then you have the question of how much it costs. John and Lydia Chilton had this paper, The Labor Economics of Paid Crowdsourcing, where they ran some experiments and they came out that the median reservation wage for Turk workers was $1.38 an hour, which is not much. The minimum amount that you're willing to work for. And then in the paper that we have in the working paper, we found in that task it looked like their reservation wage was like $0.14 an hour. So there's a lot of people that are willing to work for not that much money, but the amount that you pay depends on how quickly you want workers. If you pay $0.05 for a task, you can get hundreds of workers, but you're going to have to wait a while for a few weeks or something like that for them to accrue. So you're waiting for the recruitment, not for the completion of the work? Yeah, yeah, right. So you specify in general, you can specify the maximum amount of time that they're allowed to have from when they start to when they finish. The default is like an hour. OK. And so we did an experiment where we were paying a $0.40. So I usually do between $0.20 and $0.40 flat rates and then a bonus of about up to $1 depending on what happens. And the last time we did one of these, $0.40 flat rate, dollar bonus, we got 1,500 people in two and a half days or something like that without getting up out of the chair, just sitting there. In fact, I'm collecting subjects as we speak on several studies. One classic question for this is clearly you've got some large selection bias in the set of people who choose to sign up for this. And where they are located geographically, the type of people they are, their level of education, bloody blah, blah, all things that I would imagine would be important to some extent to control for an experiment. And also, I'm sure you're going to come to it, but what kind of, you know, a whole bunch of psychology experiments involve physical stuff. I mean, this doesn't involve some survey to ask or some question there, is that right? Yeah, so in terms of all of the bias stuff, I'll come to it. I mean, basically the format of this talk is it is really easy to explain why the internet is awesome for doing this stuff. And then most of the talk is trying to defend why it's reasonable, if you see what I mean. So there are certain sort of physical limitations where if you want to see that holding a hot cup of coffee makes someone behave differently, that's not going to happen on Turk. So you're restricted to things that you can do on the internet. But there's a lot of interesting things that go in that category. So to get to this question of demographics, there's a little bit about who they are. So this is all of this data that I'm going to show is not from the, well, with one exception, is not from the paper with John and Richard, but it's just from some recent study that I ran with about 800 people. So this is the frequency distribution of age. So you can see the modal age is around 30, but you get a lot of people out into the 40s, 50s, and even some people over 60, although not many. For this particular sample, in terms of where they're from, we had about 35% from the US, around 45% from India, and the rest from everywhere else. And this is because it's all in English. So places where people are, you have lots of reasonably good English speakers is the US and India, and $1.34 per hour is more meaningful in India than in the US. This is self-reported. Yeah, sorry, so that's another point that all of the demographic stuff that I'm going to show is self-report. You can get geographic location with IP, but yeah, so there's noise, and that's in general an issue with Turk is you get a lot of noise, but the idea is that if you get large enough sample sizes, well, you compensate with higher noise with order of magnitude bigger sample sizes. Yeah. Yes, right, so there's a couple of different, they introduced this relatively recently where you can put in some qualifications at the beginning so you can pick what country, that is you can put in criteria that they have to be from certain countries, and you can also put in criteria on their age, well, on over 18 specifically, and you can put in a requirement on what percentage of their previous jobs have been accepted by the workers, because after they submit the job, you can either accept it or reject it. And so the default is you only give work to people that have 95% of their previous tasks accepted. The way that it works, so Amazon does all of that, that's the beautiful part of it, is you put the money into your account, and then they sign up for an account where they give Amazon some kind of account information, like a bank account number, and then all you do is you tell Amazon, transfer this amount of money from my account to this worker ID's account, and they do it. And it's totally anonymous. All you, like Amazon, you don't know anything about any of their personal information except what they tell you, but Amazon does, so there's that sort of wall of anonymity, which is good for various reasons. Why does Amazon do this? They take a 10% commission on everything that you pay, which still is not that much money, and I heard that, so the reason that Amazon originally created Mechanical Turk is they had a ton of internal Amazon tasks that were human intelligence tasks, and so they originally set it up like labeling CDs for the Amazon.com kind of thing, and so they originally set it up as a way to get work that they needed done, and then eventually they opened it up for other people to be able to post their own jobs. That's a really good question. I don't know, it would be easy to do, so yeah, that's a great question, we should do it. You'll get to this, but when you're getting the results in, I know it gives you the option to accept or reject, kind of a move or spoof. Do you go through each one and check, or do you just prove them all? So we do everything in an automated way, and so we read it, you get out from your survey, like an Excel sheet, say, with all of the information that everyone entered and their confirmation code, and you get from Turk an Excel sheet with each person's confirmation code, and their actual worker ID, and then you sort of use that to pay them, and so we will, and whatever, based on whatever criteria we want, we'll determine for each person, should they be accepted or rejected, and how much bonus should they get if their work is accepted, and then you just upload that en masse back to Mechanical Turk, and it does everything. You do review each result. Yeah, yeah, yeah, yeah. But not by hand, because that takes forever if you have 1500 guys. We're using different voting algorithms for different kind of coverage to get several different people to take on a question to see what the average is. Do you have to filter out the noise in different ways for different experiments? So I'm not sure exactly that I understood your question, so let me say several different things and see if any of them work. So one thing is that when you post a hit, every worker is only allowed to complete that hit once. So one thing that you don't have to worry about is the same person completing the same thing multiple times. And so if you have different treatments, what we usually do is you post one hit, you give them a link, and that link is a redirect that redirects them to one of a bunch of different treatments that's done in a sort of automatic way where each person time, someone comes in, it sends them to the next one in the sequence. And Amazon goes to pretty great lengths to minimize the amount of multiple identities. It's not completely impossible for one person to have multiple identities, but they try and make it hard, because for their own purposes, they want to have a stable worker reputation system so that the employers feel confident that the workers can't scam them. And so if you're collecting data, but also more generally, and I'll say this a little later on in more detail, a very important thing is to make sure that people are paying attention and that people get it, because lots of people don't. And so that's like an important part of any of these studies is to build in comprehension questions and attention checking questions to make sure everyone's on board, to screen out the people that are not on board. So you might ask, what are the motivations of people to do jobs on Mechanical Turk? And so this is a figure that John put together for in our paper. I mean, in this he'd give them three different options to sort of best classify why they do jobs one is to make money, one is to learn new skills and one is to have fun. And so this is broken down by location and so the width is the fraction of people from the different locations and the height is the fraction of people with the different motivations. And what's from my point of view awesome about this plot is that both in India and in the US exactly the same fraction of people, which is almost all of the people are mostly motivated by the desire to make money. Which means from an incentive point of view, that's good. You want people that take the incentive seriously. So even though there are small amounts of money, people are doing it to make money. And I did it in preparation for this. I launched a hit yesterday that was just why do you do jobs on Mechanical Turk? And people were saying things like I'm a college student and a single mother and even though it's not much money it's a way to supplement my income. Or people would say that it's a way to sort of efficiently harness downtime. Or better than watching TV. I'm not making that much money, but better to do this than to do nothing at all. One quick question is your average reservation wages must be based on some assumption of the rate at which people complete tasks. Right, so you just gave the example of someone who's a college student. I mean is there a possibility that people perform a task 10 times faster than the average or something like that? So someone who's making, you gave a reservation rate less than $1.40 is actually, they're $14 for some people. Is that a possibility or not? Do you know how long it takes? No, you know how long, so. Can people do multiple tasks at once, for example? Can they start out multiple hits and just be running them in different windows and so on? That's a pretty good question. I don't know if you can be running multiple hits at once, but we definitely keep track of how long each person spends on each screen in any of the surveys that they're doing. Although you don't know things like are they watching TV, you know? Yeah, yeah, yeah, that's true. Because this seems an extraordinarily low wage for people even who are doing it in downtime to be taking the U.S. $1.40 an hour. Yeah, so it must imply that something weird going on. Yeah, so another thing that John found, for example, is that people don't just, so he did a thing where he would progressively decrease the amount that people were getting paid each time they did an additional task and it wasn't a thing where people just, the less you paid them, the less they worked, but they had goals where it was like I wanna work until I make a dollar and then I'm gonna stop or I'm gonna work until I make a dollar 50. And so even if that sort of stopped making sense and was non-monotonic change and how much money they were making, they would still go for those kind of iterations. And also some of the tasks are not that bad. So like if I run a task that takes you maybe four minutes and you get a dollar 20 or something like that, when you think about the actually hourly wage that comes out of that, it's not terrible. No, but then it doesn't swear with the reservation. Yeah, no, no, so I'm saying that there are, and it seems like the reservation wage calculations are coarse in that you saw that between those two different places where he estimated it was an order of magnitude difference. So it's gotta depend a lot on the details. And also I think that there are some tasks that people find more intrinsically enjoyable than others. And so the reservation wage is certainly gonna be mediated by the type of task. And so I think people, at least the people that do these surveys tend to find the sort of social science experiment kind of things interesting much more so than transcribing texts or something like that. I guess it's because I come from an experimental economics tradition, but I like it because it means that you have a way of controlling the size of the incentives. I like it because if that wasn't true, experimental economists wouldn't take my experiment seriously. Or more to the point, if you have a lot of people using mechanical turk and telling you that they're doing it for the lulls, it's possible that in an experimental context they're going to give you absurd or outrageous answers and you can't believe them. Going in and actually getting the evidence that most people are doing this for money and therefore have a likely disincentive if they're screwing around or not asking the accuracy questions. There's the big thing that people seem to have discovered with this is that if you don't ask questions and penalize people, you just get people who sort of sit there and click. But if you start asking these accuracy questions, then you force people to take it fairly seriously. And the weird thing is this notion that they'll take it seriously even though that's a surprisingly low amount of money. If they're just having fun with it, there's very little guarantee that they're going to take it seriously which makes it much harder to do so. Right, and that's what I mean. I'm controlling the incentives. That if they care about money, then you can really sort of ask what's going on. Aaron Koblen, who's an artist who does a lot of work on mechanical turret, did an experiment where it took a $100 billion, divided it into hundreds and to, I guess it was 10,000 sections and then had each person draw just up a little piece of the bill and then reassemble the image. And the graph was fascinating to this point which is that lots and lots of people logged in from the US and did exactly one hit and they were doing it for fun. Whereas most of the dollar bill was created by people in Egypt and China who were logging in over and over again doing it as a job. So he actually ended up with a histogram not just of people doing it for this reason or that reason but actually by country could sort of graph what the different motivations were. And exactly as Ethan said, the people who were logging in just to have fun were the people who weren't doing the task. They were adding graffiti or what have you too. Right, right. I think again it depends on the type of task. If you're having someone fill out a survey, there's not much fun in that. I think that's, because I haven't seen much of that and I think it's because of the type of task. Break down of the types of tasks and their popularity, like how much of the turk or tasks are just surveys, one survey after another and how much of them are interslabling other types of productive tasks. It's a really good question. There's this guy, I think it's Stern who has collected every hit basically ever posted forever. And it's this ridiculous and large data set of it and I think he's done some interesting things with it. So I don't know, but there, because all of the hits, every time someone posts a hit that goes into the public domain if you're a worker because you can see it, you can collect data on that and find out about that sort of thing. So he's at Stern, Panos, and this last, is this the good idea? I guess I have two questions. I'm related to each other. One is do you have any rough estimate as to, you mentioned some attention checking questions. So how many people, like how much of the data you collect you have to throw away because they weren't paying attention? A lot. So there's a couple of different kind of questions. One thing that we do is we put this question at the end that says, do you have any, so it's like a big box that says comments and then above it there's a chunk of text that says, we like to know different things about our participants, particularly we like to know if you're paying attention. So if you actually read this, then leave this, the difficulty indicator blank and in the comment box just right, I read the instructions. And something like maybe 30% of people fail that and also there's really hilarious things that people are really bad at writing. I read the instructions even when they read the instructions. The people I read the directions, I have read the instructions, but you know, so maybe there's something like that and then the more I saw, that's how I started screening for attention but now mostly what I'm doing is I'll explain the payoff structure of the game that I'm gonna run and then I'll ask them some comprehension questions about the payoff structure and then I'll just throw out the people that get the comprehension questions wrong because that suggests that they don't understand the game and for that it depends on how good my instructions are, basically. So for simpler games, again, I'm getting maybe 25 to 30% of people that are failing for slightly more complicated games or slightly less clear instructions, I'm getting closer to half the people. But the thing is, you don't have to pay those people if you don't want to, you know what I'm saying, you just reject those things. Just the comprehension rates. That was reminding me these little tricks you play when I was a student, you'd write something in the answer to see if the professor was reading the exam or on the flip side, you know, I have my big syllabus and students would keep asking me the questions and it's in the syllabus, so I came up with the fact and they still asked me the questions. So then I gave them a little quiz, like on the second week about the content of the syllabus, including how much is this quiz worth and it wasn't worth anything, but 30% would have actually been a really good error rate. So again, all of this is relative, so what are comprehension rates for if you were doing this in the real world? It's a good question. I think in, yeah, it's a good question. It's a good question. I don't really know, also a lot of the experiments that I run in the lab here are repeated games rather than one shot games. I should also say that that's a major limitation of Mechanical Turk right now. It's really easy to do one shot experiments, but it's much harder to do experiments where you need feedback and sort of repeated interactions between people because A, you need to get them there at the same time, and B, you need some kind of more sophisticated piece of software for matching them up. So people, you can do that. So at Microsoft Research, a couple of people like Sinsury and D, Watts had this paper where they built some particular piece of software to run a repeated public goods game and they did and it was great and everything worked, but it's more intense. But also in those repeated games, there's less problem. Basically, often people will learn what's going on if they do it multiple times, so they have to be paying some baseline amount of attention. So let me try and get through a little bit more of this to give you guys a flavor of what's going on. So this is an education level, a US versus non-US, and so you see most people both in the US and US have a BA, and then from the non-US, you have up to 30% people that at least self report that they have a graduate degree and around 10% in the US and it's sort of switched the difference between the US and the non-US is all of these people go from in the US having just attended but not graduated college to having graduate degrees outside of the US. Frequency, say histogram. Okay, but it's just. That is a sums to one, you know the yellow sums to one and the blue sums to one. It's not much of an argument for getting a bachelor's degree. So this is self reported, right? Yeah. Do you think there might be since this is pretty counterintuitive? I mean it's not that counterintuitive to me because who is the more education the more likely you're going to be at mechanical level. Yeah, because they have to know about Turk, you know they have a regular access to Turk. Yeah, and there's a huge bias in that. Yeah, that's going to skew you towards people who are comparatively wealthy and who have a bachelor's degree in that case. So income, again this is self report so you take it with some grains of salt but the most common thing is for people, so this is sorry the way this is set up this is people making less than $15,000 for your 15 to 25, 25 to 35, 35 to 50, 50 to 65, et cetera. And so most people from the non-U.S. are making less than $15,000 a year which isn't really a very informative thing because you want it normalized by cost of living. But within the U.S. still the most common is people that are not making very much money but there's a reasonable spread down into people that are making significantly more money. And this, yeah, I think is, well, I'll get to that in a minute. But I think on a lot of these different dimensions is it places where the Turk workers are much more representative than college undergrads who basically would all be in that bin. That's a good question. I don't really know, from some just anecdotal, like asking people about why they do work. Some people have said that they do Turk jobs while they're at work. You see like time series, day, it's a bunch of tapping in the evening. Yeah, yeah, yeah. Little short bits and then it bursts. The, I don't know if this is a quick slide. It's like MturkTracker.com or something. He's got, basically he's turned it into kind of like a index of the entire market on jobs and payments. Index by time. And it's, you can make lovely graphs. Yep. And another thing about the sort of workflow of a job when you post it on Turk is that you get many more hits when you first post it because it's sort of at the top of the list and then it floats down. And so that's a reason that it's really important when you launch a hit that you have planned ahead of time all of the different treatments that you want to one. So you have random assignment across treatments through the whole duration of the hit because if you add some extra treatments halfway through, you're gonna have a lot of non-random assortment because different people presumably are signing up later in the, you know, are accepting the hit later in its lifetime. So I ran a hit yesterday asking people to say what was the favorite hit that you ever did on Turk? And so these are some things that people really like on Turk. Thing where you classify different advertisers, like look at them and classify them along some dimension, writing a paragraph about a scary situation you have been in. I think that might have been one of my studies, I'm not sure. Type phrases into search engine and record the number of hits that come up. Play a cooperative game with other workers where you had to explore a map and work together. I know this was a subheading of that actually, saying that their favorite hits were the ones that pay a fairer, not third world hourly wage. And this one guy said that he was rating different transcriptions and he loved it because the work was so interesting and stimulating. So there is like a heterogeneity of motivations. This gives you just a little bit of flavor of things that people are up to there. And so this is sort of obvious at this point, I think, but what's great about it is that it's fast and it's cheap and it doesn't take much time for you. It's incentive compatible, although they're small incentives, which I'll talk about in a minute. It gives you a really easy way of getting cross-cultural data, although again with some strange bias because the people that do jobs on Turk in the US are probably a different slice of the population than the people that do jobs on Turk in India, but it gives you some way of getting people from a lot of different countries without getting up out of your chair. And it also has a great potential for doing field studies where people don't know that they're in experiments. This is what John Horton calls the experimenter as employer paradigm. So the idea of field studies is instead of saying, hey, you're in this experiment, do this thing, you just set up some natural task that people don't think of as an experiment but you can manipulate, make different manipulations. And so because most of what goes on on Turk is people doing actual work, you can post things that just look like regular jobs but you can manipulate all different kinds of things. And I'll talk a little bit about a couple of cool field studies in a minute. So there's the complaint, isn't it a bias sample? And so my first response is, and what about college undergrads that are what 99% of social science research is done on? So that can't be a non-starter because that's the state of play. But it is an issue, obviously. And so I think that with the Turk people you have much more age variation, SES variation, education variation, geographic variation than when you get with undergrads in the lab. They're a bit less weird as, you know, Joe Henrich is this anthropologist that has this Western educated, in dust, I forget what I is, rich democratic countries is what all the studies are done and they're not representative. What? Is it the one step? Isn't that the other? Yeah, yeah, that sounds right. And so I think those are good things. And then another point is that if you wanna estimate the level, like if you wanna say what is the level of generosity in people, then it's important that you have a representative sample because otherwise you're not gonna have a representative estimate. But if what you're looking at is comparing the changes between different treatments, it's less of a problem. It's still, it's not no problem because you could have interaction effects where different treatments have different effects on people depending on who they are. But, you know, it's, you're less worried about having a nationally representative sample or whatever. And something else that I should point out though is that what you give up to some extent when you're running things on Turk is control over exactly what people are doing. Like when they're in the lab, you can look at them and so okay, I've occasionally had some person that had figured out how to get explore open and was watching music videos while he was running the experiment. But, you know, that doesn't happen very often and you have much more control. And here it's like, you know, you were saying they could be doing five different things at once, they could be watching TV, they could be doing whatever. So that's an issue. And this is where we sort of rely on some replication studies to try and show that in a lot of domains, they look pretty good. The second thing is not that bias of the Indians prefer to do one type of it. The Americans prefer to do another one or for instance, male prefer to do one type, because if that's the case, then the advantage of having this more ingenious population will be more there. I mean, so I think I'm sure that there are some preferences like that, but I think they're relatively weak as evident by all the different things that I've run. I always have about 50-50 gender and about 50-50 US Indian. So I'm sure that there's some of that going on, but I don't think it's a huge, it's certainly not to the extent that it makes it as unrepresentative as undergrads. And I love undergrads, because I do most of my studies with that, so I'm not Joe Henry, but I'm just saying that I don't think that that's a problem with Turk. So, okay, and another issue that economists in particular have or is, okay, but aren't the stakes too low to be meaningful? What's a dollar stake? Like, who cares about that? So this is some work I did with Oprah, Amir, and Kobe Gall that we haven't published yet, but we compared dollar stakes with no stakes. So in the dollar stake condition, you get a $0.40 show of fee and then you have a dollar, say, this is a dictator game, so you have a dollar just split between you and some other random worker. And in the no stakes case, you get $0.40 for accepting the hit and then you have 100 points that you can sign between you and some other guy and the points aren't worth anything. Logging on and doing similar kind of work, but for nonprofits or like the sort of the framing of the work is a different kind of thing. I wonder if you see differences in motivation or the kind of participants. Yeah. Where with some of that it's a great question. I just had, I was interviewed for some story about a couple of those sites on Canadian broadcasting maybe a month or two ago and I got an email from one of the people saying, hey, if you want to do studies, we should talk. So I'm exploring that. Okay, so if we want to know how did the behavior in this split the dollar game look compared to other studies, this is from a meta analysis that was published by Kristoff Engel this year where he has 616 different dictator game experiments run in the lab. And this is a histogram of the average contribution or the average donation amount as a fraction of the stake across all of these studies. So what you see is most of them are in, people are on average giving 35% and there are some that are way down here on the tail but this is most of where it is. And this is in the meaningless. This is in the no stakes. No, this is in other people's studies. No, this is neither. This is not my, this is 616 studies that other people have run in the real life. And they're giving 35% or taking? Giving and keeping 65 or whatever. So then I want to say where, okay, so how do my no stakes and dollar stakes compare to this? So no stakes is the average donation was 45%, which is, I mean, it's not totally off the end but it's definitely out on the high end and something that would make me a bit nervous. And then I ask, okay, what about the dollar stakes? And it's exactly in the middle of everything that you would see. So this is like, at least to me, this seems like pretty much a home run for dollar stakes in a dictator game. You know, with no stakes, people aren't, you know, people are being, it's what they call cheap talk. So the reason you want stakes in a game like this is that if there's no money, it's easy to say, oh yeah, I'll give the units away, whatever, because it doesn't cost you anything. Whereas if it's real money, then you have to really have a preference for something that makes you give it away. Yeah, so I think basically what he found was mild evidence for the effect of stakes, but not much. There's been a few studies that specifically look at stakes like $10 versus $100 or go to a third world country where it's three months wages and things like that. And if fine, it makes a little bit of difference, but not much. So this is, and we did another one. This is the trust game. It's a slightly more complicated game because in the dictator game, there's no motivation. The only motivate is like, it's like a really simple game where it's not even a game in the technical sense because your payoff is just whatever you want it to be. So here we have the trust game. You have two players, player one has some money. In our case, it was 40 cents. Chose how much to give to player two. Anything that he transfers to player two gets tripled and then player two chooses how much to give back. So this is one of the social dilemmas where a rational self-interested player two will always keep everything. And so a rational self-interested player one will never give anything. And then you miss out on the great benefits that could come from trusting each other and people don't really do it. So again, this is a meta-analysis from Johnson and Missle in 2010 where they give 143 trust studies that they act from the real lab that they aggregate. And so this is the fraction sent by player one. This is the distribution. And so here's the case where they're just playing for units, not real money. 1.58 is right in the middle of where it should be. And if you ask what happens if you do dollar stakes, it's exactly the same. So it's pretty interesting that dollar stakes makes a big difference in the dictator game which is this motivationally extremely sort of barren or simple experiment that in general has been shown it's really easy to push things around in the dictator game in all different kinds of ways. But in this even slightly more motivationally complex game whether they're stakes or not, you get exactly consistent with what happens with stakes in the lab. I mean obviously the question is you're not showing us any of the distribution, is that right? You're just showing us the means. Yeah, that's true. And then the distributions look similar because you have distribution stuff as well, right? Yeah, so I don't have that plus. So you definitely see in the dictator game the main change which is totally consistent with what happens in the lab is when you go from zero to dollar, you get a big increase in the people that give nothing. You know and you see that shift of the mode from 50% ish to zero. For the trust game it doesn't make any difference at all. Not even in the distribution. Yeah, yeah, even in the distribution because that's another result from the mid 90s where they did dictator game and ultimatum game in the lab with $10 or no stakes and they found in the ultimatum game it didn't make any difference in the averages and it increased the variance a little bit. But we didn't even find any significant effect on the variance. And similar thing if you look at the amount returned by player two without stakes, it's 0.45 which is right where it should be. And with the dollar stakes it's 0.47 so no difference and right consistent with what's going on in the lab. So does that say that it's not a key result of having stakes that matter? I don't know. Well, I think it's saying that in a lot of these social dilemma games it's unclear how much stakes really matter. Like there's a lot of psychological investment amongst experimental economists including myself in the importance of stakes. But all these experiments with stakes and my thing with dollar or with nothing are all looking about the same. So you don't think a dollar's just too low? So that's the question. Is what would happen if you go to $10? Well, okay, the reason that I don't think, so I'm not saying that if you go to $10 or $100 it won't affect anything. But what I am saying is that zero stakes or dollar stakes isn't too low in the sense that it matches with what's going on in the lab with $10 stakes. So maybe $10 stakes are too low in the lab but that seems unlikely given these cross cultural things where they go and do three months wages and so that stakes and now some more replications. So we replicate the Kahneman-Tversky Asian disease framing thing whether you've framed it as gains or losses causes a preference reversal and you know people. So okay, this is a classic thing where you, there's two different people are presented one or the other thing. There's gonna be some flu outbreak and you have two different treatment options. With option A, 200 people are saved whereas with option B there's a one third chance that 600 are saved and a two third chance that zero are saved. So one set of people get asked and say which do you prefer A or B? Then the other half of the people get something that is exactly the same but here it's just phrased in terms of deaths instead of saved. So 400 are dead versus one third chance zero are dead and two thirds 600 are dead. So numerically these two are exactly identical and you get a big preference reversal that here 69% so this is from our data 69% of people choose option A and here only 40% of people choose option A. So the way you frame things is the language matters. Or it's found to what the typical inventory is? Yeah and it's super statistically significant and it's on the same order of magnitude of what they had. And other people, like I know at least three or four other people that have replicated, this is the first thing anybody ever thought of to replicate because it's what everybody loves the most. Another one is priming. So priming, giving information that doesn't have anything to do with the monetary payoffs or the actual sort of what matters will change people's behavior. So here we did an experiment where we had people either read a religious passage about Jesus talking about the importance of charitable giving, the how hard it is for rich men to enter the kingdom of heaven sort of thing or some passage describing three kinds of fish. And then we had them play a one shot anonymous prisoner's dilemma and so we find that the prime works but it depends on who you are. So here you have people that don't believe in God, itself report not to believe in God versus people that report to believe in God and then white is the neutral prime and gray is the religious prime. So amongst the people that believe getting the religious prime almost doubles the amount of cooperation they have but amongst the people that don't believe in God you get a substantial decrease in the amount of cooperation. Which I think is pretty fun. So priming works. Basic wages and labor supply. We did an experiment where we first have people transcribe a paragraph and get paid 30 cents and then we give them the chance to do a second task for some amount of money that varies across subjects. And so here we show the amount that we offer for the second task and the probability or the fraction of people that accept the second task. So 25% basically everyone takes it 15 cents basically everyone takes it five cents still a lot of people but significantly less and one cent. There's still half the people are taking for one cent. And so it's from this, this was the one where John calculated the reservation wage of 14 cents per hour based on the decision that was going on here. The experience is there to what extent are these masks as are these the field experiments or are these obvious experiments? So the first two that I showed are obvious experiments and this one is a field experiment. Understanding that economics and psychology differ. How do you think the increased ease of collecting data? It sounds like you can do thousands of these in a year. How does that change the nature of the profession? And does it diminish the value of research when you can just pump out studies like this or quite the opposite? Is it like researching DNA and will accumulate tons of knowledge? Yeah, so I think my baseline feeling is that it's totally awesome. And then it's like, it's a democratization and an opening of door that anybody that has an idea, you don't need money, you don't need a lab, you can just run experiments and see what happens. But the flip side of that is that you also have an amazing potential for file drawer problems where the idea being positive results get published and negative results go in the file drawer. So P values of 0.05 means you accept as sort of statistical proof that something has a five in a hundred chance of having just happened randomly. But that means if you take some study that is just completely, nothing is going on there at all and you run it a thousand times, a lot of those are gonna come back with a positive association just due to random chance. And so you get into this really serious problem of, and this is still a problem in the real laboratory, it's just way, way easier here. So this is something that some kind of standards need to get set up for. And I mean, fundamentally, all of the scientific publishing is based on trust because if you, I mean, whatever, you don't need to run it a thousand times, you can just make up the data to be what they want you to be. So there's that, but I could imagine something where journals would require you to report the number of times that you ran the experiment. And so still it's on trust, but it adds another layer that you have to, yeah. Psychologists are gonna have to publish four times as much? That's a pretty interesting question. Honestly, I don't think so because in my sort of personal experience of the workflow of being an academic, the thing that takes time is writing the paper. It's not doing the experiment, you know? Once we get it to the turquoise, we'll write our papers for us, then we're really in business. The two-dollar activity that I put up for experiment design has gotten lots of response, actually, let's go. Yeah, but yeah, so I don't think so because I think that that's not the major stumbling block. I think it just makes things so much nicer. So I had a paper that was like a theoretical paper and I submitted it and got back to reviews that their reviewer sort of didn't like it because it felt like the theory wasn't connected to empirical things and there was sort of an obvious experiment to do and I was like, said to myself, I got to review back on Thursday and then I said on Friday, okay, let me design an experiment on turquoise to address this and I designed it, then I launched it on Saturday, by Monday I had 600 observations. And so it's just like, I think in a good way increases the speed of the research cycle. So there's some others, yep. I think you mentioned this, but it's so anonymous that even going back through Amazon, you cannot recontact. Right, so you can send them, you can contact them, you can send them messages via their worker ID, but you don't know who they are in real life. It's not like you have their email address. So there is some possibility of doing follow-up. Yeah, you can do follow-ups, right. So it's anonymous in the sort of IRB sense that you can't go to their house and bother them or something like that, but it has follow-up capability. You said that this is not, kind of turquoise is not a good for studies that require feedback. So I'm not sure what you meant by that. Yeah, so you could do like a chest-by-mail sort of thing, you know, where someone makes a decision, someone else makes the decision and then you send them a message saying, this has happened, what do you want to do? But that's very different from the kind of repeated game experiments that we're used to running in the lab where it's instantaneous back and forth. Like everyone, you know, it's the difference between chest-by-mail and internet chess club, you know, where you're just playing with each other. I just wanted to question this. Can you put together or request the group of people that use for experiment A to use the same group for B? You can send them email. You can send them all messages and say, hey, remember when you did that great experiment A for me? Well, we have experiment B now and we'd love you to, you know, participate in it. But it's, you know, I'm not wildly optimistic about the return percentage based on some experiments that I've done. I'll skip some of this stuff. I just want to show some, a couple of cool field studies that people did. So John has this awesome paper called employer expectations, pure effect and productivity. And so what he does, among other things, is he has subjects come in and they get an image and they say, you know, apply, you know, give some labels to this image. And then they get shown work that another worker did, they're labeling the same image and he gets to choose whether that other worker's work should be accepted or rejected and then also they have a dollar bonus that he gets to choose how to split the bonus between the two of them. And what I love about it is that this is a totally contextualized dictator game. So there's always this complaint about altruism in these lab experiments, that it's a totally artificial setting, it's mono from heaven, they just give you money, so, you know, it's easy to give away or whatever. And so here we have a thing, they don't know they're an experiment, they're doing a task, they're doing this thing, they're earning this money, the other person is earning the money also and then they get to choose how to split it and they will, you know, oftentimes give a lot of money to the other person and it depends on the quality of the other person's work. So they'll sort of be more selfish towards people that do low quality work but only if they themselves also did low quality work. So it's like, high quality workers punish low quality workers and because you can do a million experiments and it's experiments, you can say it's not just well, maybe high quality people are just more intrinsically inclined to punish or something like that because he has this other treatment where while people are working, he has some pop-up block or some pop-up thing keep popping up and distracting them so it decreases the quality of their work and he can sort of force people exogenously to do low quality work and he finds in that setting they don't punish low quality work by others. This is a really nice totally externally valid like contextualized, you know, game theory experiment basically and another thing that's a sort of real world framing thing that gets a little bit to the microvolunteering question I think is Dana Chandler has this paper where they had subjects, they give them medical images and they're supposed to like circle some pictures or like find like circles in the picture, point them out or whatever and half of them are told that this is helping cure cancer and then a half aren't told anything just do it and we pay you for it. In other words, this is helping cure cancer and we'll pay you for it or do nothing and we'll pay you for it and they found that US subjects were more likely to accept the hit when they were given this sort of meaningful frame in which it sits although the Indians didn't have any effect either way and so that could be a thing where the monetary incentive is bigger for the Indians although our state study doesn't really find that I should have said that in the thing we were comparing dollar stakes with no stakes there was no difference across the Americans that is the Americans are always more generous than the Indians in those games but whether it was dollar stakes or no stakes didn't affect that at all so it seems like stakes are not a compelling explanation for that. Maybe you can get data on both which workers who looked at it. Yeah, yeah, yeah because you can see the number of people that so I don't know how they did it but the way that I do it is when people click on a link it says here's this task what do you wanna do and then you click on it and then it takes you out to an external site and then the site says okay here click okay the first page is just a we have 20 questions or whatever click okay they click okay then the first page explains the instructions and then they can go on and on so you can see how many people get to that first page and then just quit. So like I said I don't know the details of how they did this one so then just a little concluding of this is a lot of time about why it's great and then there are the pitfalls so one thing that we talked about already or most of this we talked about I guess already but one of them is that a lot of them don't pay attention so these attention checkers are really important a lot of them don't understand the instructions and I think part of that is I'm used to writing instructions for Harvard undergrads and like when you're writing instructions you need to be really put a lot of time and effort into making them as clear as possible so non-native English speakers can understand them and people that are not paying that much attention and all of that and also the comprehension questions are super important. Another big issue that is not really an issue in the lab here is non-randometrician which is to say when you assign people to different conditions you wanna make sure that people are not so you know you could imagine for example if you have two different conditions one where they just do something and one where they have to watch some like look at some horrific pictures of dead bodies or something and then do it that a lot of people that start the dead body condition are gonna say I don't wanna do this and just leave and so then that you get no longer have random assignment to conditions because you have systematic variation across conditions but the only people that complete the dead body treatment are the ones that really are into seeing pictures of dead bodies or whatever you know. Another thing that this is just a very specific issue is you can block workers and we thought that this would be a great way to because we had to repost a hit and we didn't want or we had we were doing another experiment that we didn't want people from the first experiment participating in and said oh good we'll just block all of the workers that participated in the first thing we blocked I don't know like 2,000 workers or something like that we got about 5,000 really angry emails because Amazon sends them a message saying you must have been doing bad work because somebody blocked you and if they get three blocks they can get kicked off of mechanical turks so it's really bad and don't block people and then there's this ethical question of how does it feel to run a sweatshop? I'm wondering to what extent this is an oddball left field question but to what extent does this stuff catching on crowd out the willingness of people to contribute to other projects for free? I think that they're really different like how many Amazon turkers are also contributors to Wikipedia my theory is I suspect not many I have data on that actually in the last experiment we ran with 13 or 1500 people whoever one of the questions that we asked was how regularly do you contribute to Wikipedia and I haven't looked at it but I can do so shortly and actually I could probably do it live in like one minute but I just wanted to conclude by before I run out of time by saying these are all the different people that work at Harvard and so he introduced me to Turk and Richard introduced me to John so those are the important Shao Shi was an undergrad that did a lot of initial development of studies for us OFRA is a graduate student at Banguri University with Kobe Gall who used to be here at Harvard and has been running the stake studies Yochai who was at Berkman is really interested in doing a lot of these things running some stuff with Nicholas Kristakis and Sam Arbusman on operation games and also with Thomas Pfeiffer trying to do prediction markets and things like that on Turk and I should say also that I'm giving another talk this afternoon from 230 to four at the experimental economic seminar which is completely different than this has nothing to do with this on a leniency and forgiveness in a world of errors and why that is a sort of self-interested and good strategy and that's what people do in experiments and they do well by it That means you're giving on average my talk today What's my reservation talk number? I'm not getting any wages for giving these talks this I'm really in bad shape here Yeah, so I didn't mean to cut it off I guess we have maybe a few more minutes I just wanted to make sure that I got onto that So any more questions? Put up anonymous and I've mentioned anonymous several times in terms of the sense that this really is a one shot and there'll be no down the line consequences Are there differences between bringing someone into a room and telling the person the next room is anonymous versus logging someone in on Turk? It's easier to believe that you're anonymous on Turk because you really are physically separated from that person in a way that two undergraduates in adjacent rooms may not feel like they are It's a good question I think it's a good empirical question The problem that is hard to answer that question because there's a lot of things that vary between those two settings So I don't really know I could imagine it going either way because so there's the logic that you gave but then there's the opposite logic which is the Turkers say well but Amazon knows who I am and they can contact me again in the future if they want and you know whatever So I don't have a good sense You have to bring the experiments on 4chan or something So I'm thinking about all the different answers everyone seems to give for the reservation which is our you said yours seemed to be high and yet there's other experiments low 17 cents I think it seems that part of it is that surely there are people who know how to exploit Amazon Turk and come experts at it like just finding that high paying pass and just doing those the thing that they can say I'm going to make money on this all of the ones that are going to pay this minimal amount right and so if you have that kind of behavior it seems like conducting one-off your own surveys where you ask people what they you know where you ask people questions you're going to get kind of a biased response and what they would accept as a meaningful payment would be which could explain why you got 17 cents reservation wage for the study where it was like option for one cent, five cents something really small problem of the people who have this rule of not even looking at ones that have a low payment will just be to ignore that Yeah, although I think the thing that was sort of shocking in that one was that so the original payment or the original task that they sign up for was 30 cents for a chunk of text which is a reasonable wage but even after so that isn't like a non-random attrition kind of issue right so you have everyone do that first 30 cent task and so then you're getting everyone sort of that is interested in transcribing text and then from there you say of those people here's our next offer do you want to take it or not and even of those people 50% of them were taking the one cent transcription thing and I don't really know what to make of it so but I mean also I'm certainly sure that it's true that there are different types of workers and that there are these like really sort of successful in some sense Turk workers that have figured out how to maximize the amount of money they get and in my question of what was your favorite hit that was one class of favorite hits that people talked about was or that some people said my favorite hits are the ones where you can do lots of them really quickly you know but then other people saying my favorite ones you get to play a game on a map with co-op you know with other people and things like that so it's a good thing this heterogeneity of motivations you think you could get people to pay you to do it I noticed lots of Europeans did this for fun that's pretty interesting I don't I mean I think that there's a logistical problem that I don't think that there is I don't think Amazon has it set up to let them contribute into your account but that would be a super interesting question it would have to be pretty fun I bet thanks a lot