 Okay, let's see. Thank you. So I thought what I could do is get a sense for who's here. So I don't talk too low level or too high level. So how many people have done computer science? Okay, wow. Because data science people don't claim to, you know, they're usually reformed physicists like all over. What about like stats people? Okay, physics? Okay, so that gives me a good idea. What about social sciences? Anybody here doing social sciences? Okay, wow. So feel free to ask questions. I'm going to start talking about things and then just interrupt me, ask questions when you're not clear. So I'm going to start with depressing you guys first because, you know, everybody talked about data is great so I'll start from depressing stories and then we'll try to cheer people up. So depressing story number one. There is a problem with lead poisoning in the US and other countries right now. So most of you probably know long time ago until the 70s, paint that was used had lead in there. Many other things had lead additives in there. The problem with lead is when you paint walls with lead, it's usually fine with paint that has lead it's usually fine for a while. But then after 10, 20, 30 years, paint starts falling off the walls and lead particles are exposed in the air. They fall on the ground, which is fine for us adults. It doesn't do anything for us. But for kids, when they're crawling around, they pick up that, they put it in their mouth. Lead goes into their bloodstream and the body uses basically things of lead as calcium. So the bone absorbs lead and the damages that happen because of that are irreversible. So the damages that happen with lead are pretty horrible. You've got lower IQ, hearing loss, impaired attention, memory loss, motor control issues. And that happens because once lead goes into your bones, later on instead of calcium coming out, lead starts coming out. So it cannot be fixed. There's no way to reverse the damage once it's done. So every home built typically before 1977 has lead paint in there unless it's been rehabbed or not. Today, the way this problem is dealt with is surprisingly wonderful. So what people do today, public health departments, is most kids have to get tested, their blood test at some age, typically before going to school. And if they find high levels of lead in their blood, the public health department gets a report. They go into their homes and check for lead and if it turns out there's lead, they ask people to fix it. They ask the homeowner to fix it. That's wonderful for the kid who's going to live there next. Horrible for the kid who just got diagnosed because nothing can be done. So the policy that's used today is to use kids as sensors to detect lead because that's the most efficient way of detecting lead. It's 100% accurate. Whenever there's a kid poisoned with lead, there is most likely lead paint in the home. They don't waste any resources looking for lead when they may not exist, but it's a horrible policy. But that's the best they've been doing so far. So I'll come back to that. So depressing story number one. Depressing story number two. So there's a large number of teen first-time pregnant women all over the world. So typically this is a problem. There's a non-profit nurse family partnership and they've been around for about 20-something years working with kids, a teenage first-time young mothers who are going to have a good birth but they don't quite know how to deal with being pregnant. They don't have a support structure. It's the first time they're pregnant. Often what happens with them is they end up quitting their jobs if they had one. They end up dropping out of high school they end up potentially having preterm birth. They don't know enough about immunization or other things that you have to deal with with the kids. So pretty horrible outcomes for a lot of first-time teen, young, single mothers. So there's probably a partnership that works across the country with these mothers and what they do is they pair up a nurse with a mother so that they can work through them and help them with their economic outcomes as well as the health outcomes for both for the kid and for the mom. And what they start working is during pregnancy time and they continue working up to age two for the kid. The problem they're struggling with, as they're growing, they have a really hard time trying to scale as more nurses come on board, trying to figure out there are too many risks these mothers are facing. How do I figure out which risk to focus on with each mother? How do I personalize them with the care and how do I scale as I grow? And so that's a big... And it's not just a nurse family partnership issue. There's a huge component of that in ACA, Obamacare, and they don't quite have a good handle on how to deal with these problems. Third story. Let's skip that for now. So this stat sounds like it came from a developing country. 30% of kids drop out of high school. That is the U.S. It's pretty similar to India. India is about 35% drop-out rate. And it's a pretty large number. And that's just a drop-out rate. It's not... There are other things that happen. Kids take seven years to graduate. And that's often as bad in terms of outcomes as dropping out of high school. Then there is the under-matching problem of kids who are graduating and they're qualified. They can go to a good school, but they don't apply to college. Or they apply to a college that is a much lower potential college than the student has aptitude for. And so, for example, you know, often they'll end up going to a community college with a 10%, 15% graduation rate and where they could have gone to a top-tier college. And there are hundreds, thousands of those kids around the country who either drop out or finish late or undermatch, or go to college. And right now, there's not much schools are doing to target their efforts. They have lots of programs that try to keep them in school, but the programs are too generic and they don't really have the resources or the tools to target their attention at the kids who need that intervention. So the reason I'm kind of giving some of these examples is that there's a common theme here, which is we're kind of dealing with social problems that are very prevalent that we're all facing, but very few people are working on them. And very few people are really looking at how do we solve them in collaboration with organizations that have these problems. Because we sort of, you know, so I'm at the University of Chicago and I can't solve these problems by myself. I don't know the problem. I don't understand what the real problems are and when I have the data. And even if I had all the data and understood the problems, I couldn't go off and fix the lead poisoning problem or the school dropout problem. So what I kind of want to spend some time doing over the next sort of half an hour, 40 minutes is give you some examples of these problems so you get a sense for what kinds of issues are out there that we think we can use data to help solve. And then how do we create these collaborations with the organizations who have the problems so we can help them solve the problem, we can make it easier for them to solve these problems. And then how do we get students and other people involved in working towards that model? So let me give you a few examples of how you can, you know, at least attempt to solve some of these problems. So the lead poisoning problem, if you take it from a computer science, machine learning, data science problem, it's a fairly straightforward problem to think about, at least to formulate. Solving it is hard. But the formulation is right now, you know, we've got all these different kids and they get tested after they get lead poisoning. And they've got all these different homes that get inspected after the kid is poisoned. So you can imagine, you know, if you've done any machine learning, well, we can predict which kids are going to get poisoning and we can predict which homes are likely to have lead hazards and we can go and test those kids earlier and we could do preventative inspections. And that part is kind of obvious, but the challenge is the public health organizations don't have that understanding. They don't quite think about things in preventative ways. They often think about, we've got this many resources to test these homes and if I test, you know, a thousand homes, I'm being sort of, my performance is based on how many homes would lead, do I find? And the more I find, the more my metrics go up, right? So if I'm doing preventative work, then I'm wasting some effort. I'm wasting time checking for homes that are not, that don't have lead hazards. And so in Chicago what happened was the public health department came to us about a year ago and said, well, here's a problem we're facing. We know we can do better. We know we can do some preventative work, but we have no idea what to do. So what we did was we worked with them. So we got, the good thing is in a lot of these problems, there's a lot of data that exists inside these organizations, mostly for compliance purposes, right? So they gave us access to data that was 20 years of lead inspections in all the homes that they've tested in Chicago. So for every home that was inspected, we had the, you know, the address of the home, the inspection results, whether there was a hazard, whether it was not a hazard, whether it was fixed, not fixed. And then they had 20 years of blood test data for every kid in Chicago over the past 20 years, where you know what the kid is, you know, when they were tested and what the result, what the blood level was. It's called the BLL level for these kids. And so given those two things, you know, you can imagine you could take all the data and now build a model that can predict which kid is likely to get lead poisoning. And it turns out when we do that, you get pretty good results. So we're looking at, you know, the graph on the left is the, so the CDC has BLL levels. The five is the threshold, but above five it's bad, below five it's not great, but it's okay. So if you look at, you know, it turns out that you can pretty accurately, you know, the distribution of scores for the BLL, but then if you look at sort of an R or C curve, it turns out that you can do better than a lot of, you know, this is the random line, but what they're doing today is even a random thing might not be a horrible thing to do because it's being done in a preventative way, right? So today, the policy is do it afterwards. So you're not doing any prevention at all. So even randomly doing something is not a bad idea, but we think we can do a lot better. So what we found is there are sort of a few things we found that were interesting. One thing we found that when kids are born, they have pretty low blood lead levels. So they're typically not born with lead. As they go older, so between one and two, they start diverging. So one and two is when they start crawling, they start putting lead in their mouth. After two, the lead levels are completely diverged and nothing, it doesn't change. And what we found is looking at this, we can actually predict at month three, month six, like before age one, which kids are going to have that trajectory and which kids are going to have these trajectory. So the policy change you want to make is right now the policy is test them before they go to school at age five, is just predict that and shift that at month six, month seven, month eight that requires no extra resources and actually prevents these things from happening. So based on these results, that was sort of the easy part. The hardest part in this was getting this data from them, creating this collaboration. And then over two or three months, we were able to do some pretty quick work that showed them that. But the next thing was, how do I take this result because this result by itself is, you know, it's good to talk about, but it's still nothing has changed. And the nice thing was that what they were able to do, three different things actually. One thing is what they're doing now is using these models to figure out which homes to inspect. And so they have, you know, certain resources they can inspect 100 homes a week. Which homes should they inspect and we can now target homes. One is predicting which homes are going to have lead. The second is predicting which homes are likely to have kids or pregnant women. Because just because a home has lead doesn't mean that there's somebody in there at risk of getting lead poisoning. If there's no kid in there, eh, you can skip that home for now because it's a lower priority home. Ideally, every home will be fixed, but now if there's no kid today, it doesn't mean there's no kid tomorrow or next week or next. So you're going to have to take historical data and kind of predict which homes are likely to have kids moving in there or people getting pregnant. But once you've done that, now you have a prioritized list for them to inspect the homes and do preventative work. The second way this is being used is it's being implemented into the electronic medical record system in hospitals. So when a pregnant mother comes in and gets a checkup, a flag goes up. You can get lead poisoning so that the public health department has enough time to go in, do a check and then do the remediation before the kid is born. So that's the second thing to be implemented. The third thing that they're working on is doing more targeted outreach. So they've got some budget to do inspections in places that are not necessarily high risk, but the people have to ask for it. You can't just go to show up to a home and say I'd like to do an inspection. There's models to go in and target people and say hey, you might be at risk and we can do a free inspection if you want. Here's how horrible the results are if you don't do this. And so they're using these models to go and do preventative outreach so that they could do these inspections. So that's one example. I'll talk about a few more things. The other example we're talking about the first time pregnant teen mothers. When the nurse family partnership team works with them, here's... I'm skipping. So there are a few different things that they're looking at. They're looking at making sure one, the person stays with the program because if they drop out, they can't do anything to help them. And depending on the risk of dropping out, they want to treat them differently. The second thing they're looking at is a bunch of different other risk factors. So this is their... the dropout rate. Everybody starts off enrolled and people start dropping off. Dropping out for neutral reasons, good reasons, bad reasons, and then about 40% of the people stay enrolled in the program. So they have a 60% dropout rate. And typically about 25% of the people, 30% of the people who leave are for bad reasons, for undesirable reasons. Which moms are at risk of dropping out for bad reasons, so I can pay extra attention to them up front instead of trying to get them to make sure that they have a job, that they stay in school, they really focus to keep them in the program. Or if I know they're going to drop out, I can't talk to them anymore, I can't contact them. So can I start getting extra information for them, getting names or phone numbers of relatives or parents or neighbors or something. That's only happened in Windows. Okay. So I can start doing that. So what we did working with them was this was an interesting story. When we're working with their family partnership, most nonprofits, the way they use data typically is to justify funding. And that's sort of coming from, if you talk to a typical nonprofit person, this is actually somebody from there saying, I'm going to justify your funding, somebody gives you money, you generate a report with a bunch of numbers, think here's what we did with your money. And so when they first started working with us, what they were interested in was giving a report to the Congress to allocate funding to these types of programs through the Obamacare program. And they said, can you help us evaluate our impact so we can go and write this report? I'm sure we can do that, and we sort of help them with that. Once we've done that, we say, how about we help you improve what you're actually doing by just justifying the money. And they're a little bit skeptical. So after we've done work with them, what we produced for them was basically this type of system where when a nurse is looking and a person comes in, a client, a nurse gets a risk profile for the mother. And it tells them the different risk levels for each outcome that they care about because their job is to improve all of, this is a small list of these outcomes. But their job is to focus on all of them. And what they told us is it's very overwhelming to look at all of these different risk factors. I can't treat all of them. So if you can help me figure out which one's this mother is at risk of the most, I can start really focusing on that one. So for this mom, I really want to focus on drop-out, but then also on school outcomes. She's going to be fine with her job. She's going to continue doing that. Whereas over here, the main thing, for now, CDC says it's good, so they follow that. But the other things you don't have to worry about. And so when we were working with them on this, what they said was, there are two really nice things about this. One is it really helps their nurses, especially the new nurses, to really focus their attention. The second thing helps is they have, and most nonprofits have this problem of the people in the field hate collecting data because they don't get the point. Well, I have to help these moms. I'm collecting data for you. I'm wasting my time writing down things when I should be helping the moms. That's my job. Especially when they see that data is being used to justify the funding thing, right? This report and nothing happens. When we start producing this, well, they came back and said, oh, this is great because now we can tell them why they're collecting this data. It's helping them do their jobs better and improve the health of the mothers they're responsible for. An extra incentive to collect the right data to make sure everything is there because they're seeing a very direct benefit of that data that they collect. So it's a really nice side effect. And I didn't really think of mine. I would never have been able to come up with that. We were talking to the nurse family partnership team. This was coming from them. This was a really interesting thing to look at. So this is an example of, again, when we're looking at predicting the dropout time, we'll predict pretty early how accurately we can... So later, of course, we can predict really well in the beginning of the program who's going to drop out and we can predict really well towards the end of the program. The middle is the part that the predictions are not very good at. And we're still working on this project right now to kind of figure out how do we... because we're already early indicators of what kind of people drop out and they're really good indicators towards the end. It's sort of right after birth, that things shift a little and you don't get data often enough. The frequency starts going down with that data because mothers start dropping visits and they don't really show up. And so we're still kind of working trying to figure out how do we improve the middle part because that's often where a lot of the flux is happening. So the last depressing story we were talking about of the 30% dropout rate, the way initially that project started for us was we were initially working... when we started the social data science for social good program, I got a random email from the guy who runs the sort of research group in the evaluation group in Mesa public school district in Arizona. And his email was basically we heard you doing this program, we'd love to see if you can get any help. We've got really smart kids in our school district who we think are really good, but they... Sorry. We're off, of course. Slides are optional. I'm not really using... It's just to keep you awake so when you wake up you know what slide I'm on there. So we're really good students but a lot of them end up going to the local community college that has a 12.5% graduation rate. And they said, well, we think data can really help. We've heard about people using data to do. So we've been collecting data from a copy of the software called SPSS. But we don't really know what to do with it. Can you guys help? Sure. So we started working with them and we helped them kind of identify which students are at risk of under-matching. That led to some work with a couple other school districts including so the one I'm going to talk about a little more is Montgomery County School District which is one of the larger school districts it's in Maryland. They had a drop-out problem. A lot of these kids are dropping out. Can we figure out how to identify them because we have a ton of programs we run that are intervention programs. But right now the intervention programs are just untargeted. We try to cover as many kids as possible because we don't miss anybody. But that means that it's self-selected. People show up and mostly the kids who are not going to drop out show up to these programs. So that's really not the right thing. What we did with them was we looked at the existing system and they were very data-driven which meant that they had a team of people who would look at data a lot of data, look at all the kids who are not graduating and then come up with a set of rules so it's a traditional rule-based model. So they came up with a rule-based model and they would have a criteria. So today the most advanced school districts the way they do drop-out prediction is that they have a binary criteria. Either you fit the criteria or you don't so these are the kids who are going to drop out and these are the kids who are not going to drop out. And the problem with that, one, we all know the problem with rule-based systems that are brittle, they're hard to build, they're hard to maintain, they don't give you a ranking. So you have a binary set of yes and no, either you do something with all of them or you don't. So what we did was again, we got data from them about for their kids their test scores, their grades, their attendance, typically drop-out is a function of attendance, grades and attendance, grades and sort of demographics so those are the three predictors and what we did was the idea was to, can we predict pretty early on as early as possible how likely somebody is to drop-out or not to finish high school in time and again the idea was this pretty large school district so you care less about how correct you are you care about, I can only take action on X percent of the kids I'm not going to be able to go on all of them so I can only take action on the top 10 percent of the kids how many of those kids are going to drop-out, what's my precision in the top 10 percent of my predictions and so when you look at that graph so the blue one is the one so I'm predicting in sixth grade drop-out of high school the rule-based one they had built was ballpark 40 percent accurate precision so if they had 10,000 kids they're acting on the top 1,000 only 400 of them were going to drop-out so that was their existing efficiency and what we were able to do was pretty off the shelf algorithms and random forest taking all that data we were able to predict we could easily just double that accuracy and that was for them that was surprising because they thought they were doing a really good job and when they started working with us they were a little bit skeptical, they kind of told us towards the end they were like, well my boss told us to work with you guys but we didn't think we were going to really do any better because we've been doing this for a long time and what we sort of tried to convince them was it's not like we know something you don't it's not magic you guys have done a lot of the hard work with collecting data and putting it all together and you have a lot of intuitions input and let the computer find these things so we took a lot of the same features in their model and we just sort of tried a different approach and the other thing we tried to do was not just predicting if they're going to drop out but the urgency of the action when they're going to drop out because somebody can drop out in three years there's only so much you can do right now and you might not want to prioritize doing something with them as opposed to the kids who are going to drop out in the next month in the next three months, six months so there was one aspect of just a binary prediction one aspect was to predict the urgency of that action of the when that action was going to happen so based on sort of this work one of the things we're doing now is we sort of have working with about eight or nine different school districts around the country to try to see if there is something we can generalize is there a generalizable model that can be used to predict outcomes risk outcome behaviors in kids who are going to school without finishing, not going to college other risky behaviors can be sort of do something that is across the board and that can also be useful for smaller school districts who don't have a lot of kids if you've only got 50 kids in your high school class you're never going to have enough data to build these models for yourself is there anything we can do that transfers from other large school districts is there something inherent about the kids that we can transfer over that helps us with predictions don't know, it's too early for us to say that but the idea is both can we do something general on the model side second is can we do something general in the modeling pipeline can we build these things so we're trying to build it in a way that we can tell the schools if you put your data in this form the system will kind of take it from there and start building these models and show you results that you can use to start improving things because they're not going to have resources internally to ever do this so how do we take the approaches we're taking and making it more scalable for them to do it themselves and then there are more results looking at early predictions it turns out that we're not there are they look pretty reasonable in terms of what you would expect and I can point you to some papers that we have on this on these results so abstracting a little bit as we work with these organizations with government agencies with non-profit partners we're realizing that there are three reasons why they haven't been doing this work before one reason is they just don't have access to people who can do this work a lot of us if you take people in universities we often don't have the right incentives in a university to do applied work that has an impact we're incentive to teach and to get tenure and none of those involve at least in computer science having any impact on society and it's unfortunate and in an industry it's hard to you don't really have the right incentives either and you can either go work at Google or Facebook or you can do something useful and often it's the former people end up doing so one is people lack of not lack of people who care lack of people who are actually doing the work because everybody inherently I think cares it's just turning that care into something tangible is often really hard for various reasons and we don't make it easy to make that connection the second problem is even if we had the people even if all of us said you know we want to do this and we want to help you and you say I want to help you do better and I can help you with data and often they'll say yeah I have some homes you can paint and there's some you know yard work you can do and you can go read things to kids not that none of those things are valuable but we can have a much bigger impact with some of these with the kind of work we can do but often they don't quite know how to take advantage of them and one of the reasons is that they don't really have peers they don't have use cases they don't have stories they've heard from their peers about this work so when they hear of examples they hear of examples from Netflix and Amazon and Microsoft and Google and Facebook and and they think well that's not us that's not for us that's for them then they might hear from examples when I was in my Obama campaign days they hear from examples from that but they still say well you guys had a billion dollars you can do that we can't do that so there's a lack of examples that they can relate to that makes them excited about this and the third problem which is kind of connected to part of the first problem is that there are no tools out there today that are customized for their needs if you're a large phone company if you're AT&T, T-Mobile you want to build system to predict who's going to leave you as a customer there is a billion tools out there that you can do churn prediction it's already prepackaged you put in your data it will model it may not be great but it's good enough but if you're a retailer and want to do demand prediction or pricing there's a tool out there if you're a non-profit or a government you have SPSS that's what you start with or you start with raw python and R so that you start from such a basic level that you don't have the resources internally to start solving problems because everything is too generic for you and so even if you have the people and good intentions you don't have a starting point it's just too hard to really get there so the way we've been structuring the work we're doing at least in the University in Chicago is trying to see can we get rid of some of those barriers can we start making it easy for people to one can we get people to train them to work on these problems and a large part of it is not that people don't know how to work on problems because the large part is that there's no they're not exposed to these problems even until I was doing this work I couldn't really tell you what are the big public health challenges in the world today what are the big challenges in education or in economic development I could tell you how to make Google search better I could tell you how LinkedIn could do better and Twitter could do better because I have experience as a consumer we have ideas on how to do things that we use but most of us don't consume social services most of us don't consume things are not at risk of these things so we don't have really understanding of these problems and so the first part was really how do we make people more aware of these problems and get a link between the skills that we have are actually useful in solving them it's not about unskilled work we're doing can actually have an impact on these problems and so I'll talk a little bit more about the program that we run in a couple minutes the second thing we started doing is looking at how do we create collaborative projects that give not just the people we work with examples but then their peers how do we give their peers examples of here's a real project that was done by this real partner who's a peer of yours and how can we use that as a case study and really excited about working on these things because that's often the hardest thing to do is there are a lot of people who are interested in helping solve the problem but getting access to the problems the resources that will help us understand the problem and actually do something with that that takes a lot of effort and the third part was that as we work on these projects can we create reusable open source software that people can use beyond us that's the kind of work we've been trying to do over the last couple of years and one example of a project we started is called the data science for social good summer program and the summer program is really it was more initially designed for basically people like me from computer science background who carried a lot of data and it was a lot of work and it was a lot of work and the summer program carried at some point about computer science primarily and then data second and then impact was something in the back of my mind but not something I was going out of my way to do so how can we take people with good intentions who are interested and making it easy for them to do this problem to do this work and so this was a program we started in Chicago two years ago and the goal of this program was really there were three goals one was the idea was to take mostly students, so grad students coming from different areas to actually first understand how to solve problems but then to understand how do we expose them to social problems so they can hopefully continue to work on this the second goal was to work with governments and non-profits and help train them a little bit expose them to how do I use data to solve a problem and then the third which is probably a more important goal which we're still hoping that it's a long term goal is to really build a community of people and organizations who are working together to do this it's not about University of Chicago, it's not about the handful of people we have doing this program there but it's how do we see the larger group of people and how do we help them do this because individually we can't really get this done and so some numbers from the past couple of years we've had a little bit of structure of the program but we had 36 people our first year, we had 48 the second year and it runs in the summer 12 weeks and we get students from bunch of different universities typically all the computer science and stats and public policy one of the main things that's interesting here that was different for me personally was trying to bring in the set of people and the top right right so particularly if you look at the computer science community which is sort of where I'm from machine learning side we care a lot about prediction once we're predicted we watch it happen so it's really great we admire our predictions being correct but we can't change anything we stop at the prediction level and we watch that happen then there's the social science community which if I have to caricature it's I care about behavior change I just want to do group A, group B test my ideas and see which one works better and apply the idea to everyone so that's sort of the opposite of the computer science side where I'm not predicting anything I have really good ideas and I have theories but I could put the two together and actually be much more effective and what we've tried to do is bring these people together because we think if you really want to solve a problem that's a real social problem you need to have all those different people working together you need to have computer scientists social scientists and public policy people because things have to get implemented as policy it can't just be people working and building software so we'll be trying to build both an understanding of everybody has an understanding of what each group does so that we can really create people who understand all the different aspects they're not going to be experts in all of them these are exposed to what other people do they work together on a problem with them and what we do then is what we end up doing is we bring students in from different universities we put them in teams together and each team has a project that's in partnership with a government agency or a nonprofit and these projects sort of span education public health, energy, sustainability economic development, community development we get projects that are coupled with them in Chicago some of them are international we do projects with governments nonprofits, the idea is to have enough of a portfolio that everybody coming in is really excited about some issue that they really care about and wants to work on that so that's kind of because the goal is training we want to make sure people are really excited so we get projects from everywhere so here are some projects I talked about already where the project with public health department the project with school districts so I'll give you some of the projects there was a project we did with enroll America which is the nonprofit that is enrolling people into the affordable care act and the way they were doing that initially they initially did things smartly then sort of going out calling everybody and saying enroll in the Obamacare they built a model that predicted which people are not enrolled in insurance so that they could have now a list of people and their probability of not being enrolled so they can start calling those people now it turns out that that's a good start but just because somebody is not enrolled in insurance doesn't mean they can be persuaded to be enrolled in the insurance plan so you could call the people who are most likely not to that's the prediction part I can predict they're not involved and they're not enrolled but if I can't change their behavior calling them is a waste of effort right now at least or if the tactic I have is calling them and telling them how good it is to enroll in a house it could be cheaper they might not be receptive to it so what we worked with them was to help them predict which people are persuadable and the first we took was very similar to what we did in the 2012 election campaign which we could predict you know the idea was there I can predict if you're going to vote for Obama or not but then I need to predict if you're persuadable so I can try to persuade you so the way we did that you do an experiment you actually go in and persuade you do a pre test and then you persuade them and you do a post and you build a model that basically predicts what kind of people are likely to increase their probability of being uninsured and if you're likely to decrease their probability of being uninsured and the ones you predict are most likely to increase their probability those are the most persuadable ones so you can basically predict how likely somebody to be persuaded given a contact given a particular tactic the second thing we help them do is well let's say you can predict if they are not insured you can predict they're persuadable but if you're just making phone calls what if they're just not reachable by phone you're wasting your efforts again calling them because they're not going to pick up the phone they're not at home in daytime so the third the second thing we work with them was to predict basically the contactability how contactable they are with different channels and right now they only had two channels phone and in person you can imagine doing it for email TV, Facebook, Twitter so looking at again the data they had about phone calls and if they picked up and if they responded you can build a model that predicts who is likely to pick up the phone in daytime because that's the only time you're supposed to be able to call people on their landlines which nobody really has anymore so the idea was to go from initial models to adding these pieces so you can now design a program that says I've got 100 people making 5000 calls today who should I call that the most optimizes my use of the resources I have to increase the number of people that are going to be enrolled into Obamacare another project was with the government of Mexico and the project there was dealing with maternal mortality and one of the things Mexican the UN has these millennium development goals one of them is about maternal mortality rates and Mexico is one of the countries that hasn't reached the targets they were given so they came to us and said we've been trying to reduce this rate but it's not going down we've tried, we don't really know why we have really no idea why this was one of the more exploratory projects we took so we want to have a portfolio and we don't want to have too many vague open-ended because it may not be a really good thing for a summer scope 12-week project but this is one of the ones we took and we got data from birth data and mortality data and hospital data and clinical record data and we looked at to figure out what are potential indicators of high maternal mortality in different areas and we turned out that there were a few different things one was the kinds of hospitals that were going to and the kind of insurance they had the conjunction of the location the hospital and the the insurance plan was a very big predictor and one of the things they were able to do with that is they were able to come up with a set of again with observational data we couldn't come up with any causal theories around that but we could come up with hypotheses around what could be an indicator and then use that to one of the things they're doing now is, that's where I'm going after this is they're doing a pilot with UNICEF and they're giving away cellphones to these women and running an SMS program that's alerting them to go to the hospital for certain things at the time they're supposed to go and that was sort of a function of looking at the data and saying these are the moms who need to be enrolled in this program and then this is what we think we want to tell them and running an experiment with a few different treatments to see what if those hypothesis were actually correct we could do that with just observational data so I'll have more results once the pilot starts in the summer and it goes on for about 6 months so let me actually show you guys a quick video that describes some of the other projects and also have other people describe the project so you can get a board listening to me uh let me see for me the most exciting thing to see over these last 12 weeks has been the gradual realization that all the fellows have had that they actually can have an impact on the world we've been working with the office of president of Mexico to figure out new strategies and key actionable policies that it can implement to reduce maternal deaths in Mexico it actually literally has effects on people's lives and will help save lives sorry we worked with nurse family partnership to help them identify clients that drop out early so that they can help warm mothers and children the sophistication of the modeling will allow us to develop tools that nurses can use in the field it's been extremely valuable I can't put a price on what this has worked to us we worked with the Chicago department of public health using predictive analytics tools to identify and remove lead hazards in Chicago homes before children are ever exposed we worked with Montgomery County public schools and developed a system to identify students who are at risk as early and as accurately as possible we were working with the world bank group developing methods for protecting pollution and fraud in contract bids that can help guide future investigations we worked with the Chicago alliance then homelessness to help those in need find stable housing we were working with the con street and the villager park here in Illinois to develop tools for homeowners to create insights on their energy use we worked with health leads providing actionable insights to help more low income patients get the social services they need to help their life we worked with the Harris school of public policy to develop tools for automatically identifying earmarks and congressional bills and to make public a historical database of earmarks we worked with skills for Chicago and its future using data from career builder to help reduce unemployment in Chicago we worked with Enroll America and get covered Illinois to develop new techniques to find uninsured Americans and to sign them up under the Affordable Care Act I'm from Nigeria and West Africa and a lot of the things I was interested in pursuing was using data science skills to tackle different policy problems that you have in Nigeria something I'm passionate about is access to information and after this fellowship I'll be working at the wikimedia foundation this fellowship helped me to better understand the needs of nonprofits and ways that they can leverage their data to improve their social service programs this is just the beginning for us and for the data science world so the idea when I kind of talk a little bit sorry about the structure of the program partially to kind of give people ideas of how it might be how we're doing it but also to get feedback on as we're learning how to do this how can we do it better what other things people have done that would be useful to learn from so what we try to do is try to have fellows leave with these skills not be experts in them but at least be exposed to them ranging from the computer science and programming skills to stats machine learning to dealing with data but then and how to run experiments again if you're looking at the computer science world experimental design means running a simulation on your computer overnight with offline data and very few people in computer science would actually run real experiments and again I can criticize that because I am part of that community and in grad school I would never run any experiment it's just experiment would have been something on the computer overnight and then looking at what we do again databases so how do you deal with data how do you deal with and then the most sort of almost critical parts of it is these two pieces problem formulation because problems never get handed to you in a way that you can now say this is an optimization problem this is a regression problem so we spend a lot of time teaching people how to take a problem that a nonprofit or a government agency has and turn it into something you can actually solve and that takes a while talking to people about what the solution is how do you use it, what are the limitations what are the caveats and how to really do this well so our overall summer is kind of spent working on projects and teaching them these skills some of these skills through the projects but a lot of them through workshops and tutorials and things that we do and getting people to leave with at least an idea of how to do some of this work the other thing that we'll focus on looking at is kind of on the research side we're finding that there is a lot of this intersection that hasn't been looked at these two communities have kind of mostly until the past couple of years they've been doing very similar things in some areas but very separately and there are things you can do to bring together there are a lot of machine learning research ideas that come out when looking at social science problems and vice versa and so there's a lot of new work that's happening not necessarily as part of the summer but beyond summer we're sort of expanding into the work that actually can lead to new tools for social scientists and then new problems and approaches for machine learning people who are interested in kind of solving real large scale large scale problems so and I can talk more about that later people have questions but that's kind of I think that's basically what I had to you know I'm happy to take more questions but if people are interested for this summer it's already sort of too late but there will be next summer so if people are interested in the fellowship you know we're going to have a 2016 program and then you know we're always also looking for the plug for you know postdocs and people who build software to come and work with us whether in a full time part time whatever capacity people are interested in but yeah so I'm happy to take questions talk more about you know any of the things I've talked about more projects the program other questions yeah thanks yeah as long as the interesting yeah you had a question sorry behind you yeah no no you sorry he was also raising his hand yeah sorry it's a marketing video it's a marketing video yeah so different projects have different levels of sustainability and you know just like everything else it's a function of funding so what we end up so what the nonprofit gets at a minimum is they have a specific problem that we help partially solve none of these problems are solved right three months is just not going to solve the problem and that's not the goal either so when I that's why you know when I had the goals of the fellowship nowhere was a goal that says we will solve these problems and deliver these things to people if I wanted to do that I wouldn't do it at University of Chicago I would have some sort of external understanding that is hiring full-time people and doing it and the goal isn't to I think that goal needs to be met this program is not for that so this program that the nonprofit is initially kind of getting a better sense for what they could do they are getting a partial solution to a larger problem that they face they're connected with other nonprofits who can who they now understand their interests in similar things they're getting a peer network they're getting connected to this community so one of the things we do is with the summer we do a lot of happy hours where we invite all of them invite local nonprofits governments tech community to kind of let them mingle and get to know each other and so they end up hiring a lot of people from that community because they never knew they existed and they didn't know how to get to them and vice versa for a small number of these projects depending on so we continue so the public health we were actually working with them right now and implementing things inside so now we have we're building stuff into the city of Chicago infrastructure so some of them are sustainable some of them we hand over to them and they continue doing this work and some of them get dropped at the end of the summer and we sort of really treat them as training projects both training on the organization side and our side but it's a spectrum and on one end it's implementation on the other end it's we give them something that they they learn from and then they improve and go off other doing other things and that's a that's a limitation that we are we're trying to figure out how do we work with other people to transition these projects because what I sort of think is that doing the initial part is hard it's high risk once you've proven that this is a solvable problem and you've done a prototype then hopefully other people can take it over from there and they have a better business case we've done this part and we scale it we're trying to figure out who do we partner with to kind of transition these types of projects no they're getting paid so this is yeah we're competing with the companies I talked about so they can go there or they can come here so we don't compete in that amount of money but we pay them absolutely absolutely yeah I think that's a really good question and each of the models have their pros and cons so if you start with the Kaggle model the Kaggle model hasn't worked for Kaggle that's why they're not doing well and not as a non-profit model but just as a companies so none of the things that are produced in Kaggle have gone anywhere because none of the organizations the data that's being put out there it's not the complete data a lot of these problems have data that's individual it's very private data it can't be released in the open so that limits the applicability of the solution that's built on data that's publicly available in some way so I think the Kaggle model is a good idea to include a scalable crowd sourced model to do these things is really good and what it focuses on is the model part the data is nicely structured, it's clean, it's all there the hard part, it's been formulated the evaluation metrics have been defined all the hard part that happens is done and now the for loop over each modeling approach is left and so I think the Kaggle model makes a lot of sense the Kaggle implementation I think it's not a model that will work for anyone the scoping and the formulation and what is the problem and that's the bottleneck it's not that we can't run for loops over data predicting something we all know how to do that and that's what Kaggle competitions are to be fair, people really spend a lot of time that's why the same people end up winning they've got really robust for loops minimizing their contribution so it's unfair I think the data model is interesting and they're also trying to figure out what the right model is so they started with the data dive model they realized that the sustainability doesn't happen the weekend ends and what do I do now so what's great about the data kind model is, the initial model was that it got communities together it got people to get together in a city and get to know each other so they could continue this work afterwards the thing that was produced in the weekend didn't really go anywhere which is why now they've been trying to do their internal in-house so they're hiring an in-house team of data scientists to do dedicated work with these nonprofits as opposed to hopefully relying on so they're still doing the data dives to create these communities and I think that's great because it's getting more people involved so the pros are more people involved more nonprofits involved just having more awareness the cons are sustainability is and same with our model the pros are I'm getting a lot of people involved students are getting involved, nonprofits are getting involved we look at summer thing and we do things around non-summer as well but the cons are again sustainability is a big problem right now, that's something I don't have a good answer for ideally I think we need all of them we need students to be involved in this, we need people out in the workforce to be doing this some people should be dedicated to doing this some people should be doing it on their free time some people should have these stints of fellowships of three months, six months a year and all of this kind of needs to happen, it has to be sustainable so corporations need to spend maybe their CSR money should be going towards skilled work as opposed to unskilled work which is what most of the money goes to right now so all that long answer is to say I don't know I think all of them are required everybody's doing things, there's another organization called Bay's Impact which we were talking about earlier they're trying to do similar things a different model but I think at this point there's so few people doing this that the more it happens we'll figure out the right model I think right now it just needs to be more people getting involved and that's kind of the goal right now is giving people ways to get involved and I don't think anybody has a really perfect model that gets anybody to get involved really easily with low overhead and produces something that's sustainable that moves forward nobody has that right now yeah it's a it's partially a mess so what happens is we select fellows and so right now is the time and we're almost and we have like 39 fellows who have said yes and we're waiting for a couple of them to get back to us and what we then do is we send out a list of projects that we have scoped for this summer and we ask people for their top 3, 4, 5 projects that they would like to do and do a matching and that's a big messy optimization where we're trying to make sure that everybody gets something they're excited about but also that the projects have, if the project needs NLP at least one person on the team should know that and then the project shouldn't have all computer scientists in there and they shouldn't all be from the same school and so it's and they should all be a right kind of seniority and not everybody asks people and the more senior people so we kind of have all these different criteria and we try to have something that people are happy with the good thing is that all this happens in the same space and everybody's together so people are working in teams but they might find a project that they're really excited about and they're working in their spare time people start other project collaborations with other people they get to meet and they continue working on them so right now a lot of people from the past couple of years are working with each other on other projects in their local communities with other non-profits on research projects so the projects are important but it's really the other things that end up dominating so some projects so let's say I mean so one thing that's common across most of projects is pretty sensitive data most of these are not public data school, it's FERPA if you have worked with the federal education reporting so if it's public health data it's HIPAA so all of them are sensitive they all involve personal data about people because otherwise and that's something we're very particular on if what we produce can't be used by you then there's no point in us doing this work so we do require them giving us access to all the data that they have which leads to so they're sensitive which means laptop we try to keep data on the servers and ask them you're not downloading it in terms of size I would say 25% of the projects are where we do need large scale when we're doing with Enrol America it's every person in the country had terabytes of data and so you do need for some other work we're doing career builder that was looking at all of our resumes and all the job postings and trying to find the gaps and so we do use the open map to produce over the text data try to do extractions in parallel and all the annotations and then do these comparisons and record linkage and then we've got data from you know we have the homeless alliance in Chicago which has a hundred facilities and data about 20,000 people so that's you know a few tens of gigs maybe 10 gigs 5 gigs so we've got you know small data sets that you can maybe do on a powerful laptop to most projects need servers and then 2, 3, 4 projects a year will need larger machines so everything we do typically is AWS we just go off and run things there one it's just more secure two it's you know if you need to scale and do map reduce we can do that easily we've got database infrastructure there we've got everything set up so that's our typical but it is overall the scale is on the smaller side of what I've done you know in general people do right we're not dealing with terabytes regularly most projects are in the tens of gigs because that's kind of where the non-profit world is right now they don't they don't generate that much data in a typical organization what's the what's the organization oh yeah yeah I know okay yeah decision makers and non-profit you're usually write a 3 month project that solves the problems but you know I think it has to be tentatively demonstrated our purpose influencing decision makers and making funding eventually available for them or can yeah I think that's where yeah I mean that's that's a good point right so that is that is certainly happening it's always slower than you would like it to be but it's faster than it has been so for example with the public health department that's been our we started kind of we went to them we talked to them their biggest problem was HIV STI so that's the biggest problem public health in urban areas right now they're facing and that team wasn't that interested in working with us well I don't know what you guys can do we're you know we're fine we don't need there's no problem and the lead team said oh yeah we'll take your help and what that did was so lead was important but you know not as important but the team was more excited so we were very opportunistic like okay it's a big enough problem that we can take it and even if you do something three months ago the HIV team STI team came back to us and so we heard you did this with the lead team we would really love to work with you guys and then you know the public health department their whole thing became we want to do everything preventative we've been doing this you know reactive thing their commissioner started talking about it he ended up leaving public health department and going to this Trinity hospital system which is the third largest hospital system and they are putting in a billion dollars into community development because for one is for good reasons but other is if they can improve the health of their communities it's cheaper to service them so it makes perfect business sense he came back to us and said hey I'm doing this community development thing can we work together and can you identify other people around the country who would be interested in helping us do this data science thing and so one anecdote but that's the idea is you know we'll be very opportunistic we don't go into an agenda we just want to help and giving them use cases giving them examples one of the reasons for producing these videos is to kind of show them here's the kind of breath that we're not going deep into we're not taking over public health because we don't have expertise in public health we still need you guys but we need to collaborate so we're seeing that happen and we're also seeing it happen across so same for the school district work we did with one school district they connected another one and they connected the third one and now we have 10 or 12 and they're asking us to present at this the national conference where they're going to bring in a bunch of their evaluation and research people so that's what we're trying to do is kind of just have it be very bottom up and them sort of them seeing these things so hopefully that's going to have some impact and the local government is the place where it's going to happen I mean federal is going to change but it's not going to make an impact anytime soon local is where you're going to see these things happen yeah yeah so that's been I mean what we've been trying to do as much is try to build in things that we can still maintain some of that work so having them call our API so we update the model we get the feed we keep that requires more sustainable work but we know if we just hand over the model to them that doesn't do anything so what we've been trying to do is figure out one little bit of we train them a little bit on how to do this at least explain how it works so that it's not a magic black box thing for them but then try to continue to update things and that's sort of also that's one way I don't think it's going to be a hand over I think it's going to be a continuous involvement because they don't and they have more capacity to hire data people than modeling people right they might be able to hire some programmer who can do stuff with data or an Excel person who can do stuff with data they won't be able to build and maintain the models so one of the things we've been trying to do is get them to focus more on getting their data right and collecting more data collecting more frequent data collecting better data so that if they're focusing on that then we can help them focus on the on the second piece but that's a yeah it's not a right now they're not ready for to hand over you know Python code that that does something so we're trying to as much of it is you know we explain to them what it is but then treat as okay here's the input here's the output and use it in a way that's that they understand yep that's right that's right so I thought you were going to go somewhere else but I'll answer both the question and ask them the question I thought you were going to ask yeah two for one exactly so so I'll start with the question that the your question of yeah it's it's what I mean the goal is to do something real right and in reality you are at time T and I have to make a decision about tomorrow but then tomorrow I have more data and so so even when we're doing a validation the way we explain to these people is we're simulating this world right so we're we're taking data from just like you do cross-validation in a in a time world you predict for the next how how long we're predicting in the future what is my model update time and I evaluated this way to figure out what's the optimal update time was optimal the the other thing that the question that I thought you were going to ask is there are a lot of commonalities across the project I talked about there is there is this retention you know problem which in the in the corporate world the churn problem can I predict you're going to come back there are resource ranking problems where I'm ranking allocating resources and saying you know I can only have this much money who do I go after that's for economic investments in places that's for students retention that's for things we're trying to do is build kind of these open source pipelines for common problems that nonprofits and governments will face and they're in giving people so that's what we're going to try we're working on them right now we're going to try this summer for the students coming in to start using those things and see and the use cases we have we'll see we can build things that are that at least satisfy those projects and and then kind of expand from there to let these organizations have an idea of if you're dealing with this kind of a problem here's data you need here's structure if you put it in this form we will give you these kinds of outputs so that's the intention is building reusable components that will predict different things and then also the next step is to kind of think about how do we help them with behavior change because that's less software we can suggest experiments we can still give them the methodology but it's not about building software but with a low-hanging fruit is places where the actions are fixed it's the ranking problem if I can just view the right ranking you have the perfect intervention you just go from the top down and you stop when you're done so that's a low-hanging fruit right now but the next step is sort of how do we help them understand you know how do you do experiments so why is a control group actually useful why what's the right way to do this and all that you know how can you do without completely experimental data but yeah yeah