 Live from Stanford University, it's theCUBE, covering the Women in Data Science Conference 2017. Hi, welcome back to theCUBE, I'm Lisa Martin and we are at the second annual Women in Data Science Conference at Stanford University. Such an inspiring day that we've had so far and right now we're joined by Megan Price, Executive Director of the Human Rights Data Analysis Group. Megan, welcome to theCUBE. Thank you. It's so exciting to have you here. Megan, your background is in statistics, you have a PhD as a statistician. The Human Rights Data Analysis Group, HRDAG, is focused on statistical analysis of mass violence. Talk to us about sort of the merger of your bio statistician or your statistician background with human rights. Was that something that you were always interested in? Sure, it was and I have to say I was really lucky. I got my bachelor's and my master's in statistics from a very technical engineering school in Ohio where honestly a lot of people would sort of pat me on the head and say that's nice that you're interested in human rights, you'll outgrow that. And fortunately I had one very thoughtful mentor who said to me, I really think public health school is the direction you should go in and so I got my PhD in biostatistics from public health school and it was really there that I was exposed to people who kind of said, yeah, social justice, human rights, do that as a day job, get on it. And so that was really great that I was exposed to that as something I could move into as a career. Exposed to that, but also you had the confidence, you obviously had a mentor that was very influential, but that takes some courage and some guts to go, you know what, yeah, this is needed. It's true, yeah. So talk to us about some of the HR Doug, we talked about it a little bit before we went live, the evolution, share with our viewers how it's evolved to what it is today. Sure, so the organization, the name and the work started with work that my colleague, Dr. Patrick Ball, started doing in El Salvador and in Guatemala in the 90s. And at the time he was working, he's formed a team to do the work at the American Association for the Advancement of Science. And so that was about 25 years ago. And then the work evolved and the team just kept kind of moving to where the right home was to get that work done. And so in the early 2000s they moved out here to Palo Alto, just up the street, to Benetech, another technical non-profit and they provided us a really nice home for our work for nine years. And then in 2013 the time had really come to be the right time for Patrick and I to spin out HR DAG as its own non-profit organization. We're fiscally sponsored right now but we're our own institution which we're really excited about. So you mentioned some of the projects that Patrick was working on. What are some of the things that were really compelling to you specifically within human rights that really are the catalyst for the work that you're doing today? Sure, I think that there are a lot of quantitative questions that get raised in looking at these questions about widespread patterns of violence and asking questions about accountability and responsibility for violence. And to answer those questions you have to look at statistical patterns. And so you need to bring a deep understanding of the data that are available and the appropriate way to analyze and answer those questions. How do you, from an accuracy perspective, I understand that that's incredibly vital, especially where these important issues are concerned. How does HR DAG eliminate, mitigate, inaccuracy issues with respect to data? Yeah, well we're always thinking about each of our projects as taking place in an adversarial environment because we ultimately assume that at the end of the day our results are going to be either subjected to the kind of deep scrutiny that comes along with any kind of socially and politically sensitive topic or with the kind of scrutiny that happens in a courtroom. And so that's really what motivates the level of rigor that we require in our work. And we maintain that by maintaining our relationships with mostly academicians who are really pushing these methods forward and staying on top of what is the most cutting edge approach to this problem and how can we really know that we're being as transparent as possible in the way this data were collected, the way they were analyzed, the way they were processed, and the limitations of those analyses, you know, the uncertainty present in any estimates that we put out. Can you give us an example of some of the data types, data sources that you're evaluating, say for the conflict in Syria? So in the case of Syria, we have relationships with four organizations that are all collecting information about victims who've been killed in the ongoing conflict in Syria. Those groups are the Syrian Center for Statistics and Research. Syrian Network for Human Rights, the Damascus Center for Human Rights Studies, and the Violations Documentation Center. And those are all citizen led by groups that are maintaining networks, collecting that information to the best of their ability. And they share with us largely Excel spreadsheets that contain names of victims and then any other information they were able to collect about those victims. You mentioned university collaboration a minute ago from a methodology standpoint. Give us an insight into, you're getting data from these various sources, largely Excel, where we know Excel, with Excel comes humans, comes sometimes, oops. How are you working with universities to help evaluate the data? Or what are some of the methodologies that they're recommending, given the data sources and the tools that you have? So there's really two stages that the data go through. And the first one is within the groups themselves who do that first layer of verification. And that is the human verification prior to kind of all the risks of data entry problems. And so they're doing the on the ground making sure that they've collected and confirmed that information. But then you're absolutely right. We get this data that's been hand entered and with all of the risks and potential downsides of hand entered data. And so primarily what we do is fairly conventional data processing and data cleaning to just check for things like outliers, contradictory information. And we'll do that using Python and using R. And then our friends and colleagues in academia where they're really helping us out is because there are these multiple sources collecting names of individual victims, what we have is a record linkage problem. And so we have multiple records that refer to the same individual. And so we work a lot with our academic partners to stay on top of the latest ways to deduplicate databases that might have multiple entries that refer to the same person. And so that's been really great lately. Okay, what are some of the methods that you've used in Syria to quantify mass violence and what have some of the outcomes been to date? So we rely primarily on methods from record linkage and that gets us to what we know and can observe. And then from there we need to build and estimate what we don't know and what we can't observe because inevitably in conflict violence some of that violence is hidden. Some of those victims have not been identified or their stories have not been told yet. And it's our job as data scientists to use the tools at our disposal to estimate how much we don't know. And so for that step we use a class of statistical tools called multiple systems estimation. And essentially what that does is it builds on the patterns of data as they're collected by these multiple sources to model what the underlying population must have been to generate what we were able to see. And so that's been the primary analysis we've done in Syria. And what we found from that analysis is that as valuable and important as the documented data are, they often are overwhelmed. For example, when violence peaks, it may be too dangerous and it may be impossible to accurately record how many people have been killed. And so we need a statistical model that can help us identify when the data we observe seem to plateau but perhaps our estimates tell us no. In fact, that was a very violent period. And then we can dig in with field experts and interpret, well, is that a time when we know that territorial control was in contention? Or is that a time when we know that there were clashes between certain groups? And so then we can infer further from that about responsibility for violence. So applying some additional attributors, things that are attributing to this, what are some of the differences that you think that this has made so far? What I hope this has done so far is simply to raise awareness about the scale of the violence that's happening in Syria. And what I hope ultimately is that it helps to, it helps to attribute accountability to those who are responsible for this violence. You've also got some projects going on in Guatemala. Can you share a little bit about that? We do, yeah, we have a couple of projects in Guatemala. The one that I've worked on most closely is looking at the historic archive of the National Police in Guatemala. And that's actually the project that I started working on when I joined HRDAG. And Guatemala suffered an armed internal conflict from 1960 to 1996. And during that time period, many witnesses came forward and said that the National Police Force participated in the violence. But at the time that the UN, the United Nations Brokered Peace Treaties, they weren't able to find any documentary evidence of the role the police played. And then in 2005, quite by accident, this archive that's this cache of the police forces bureaucratic documents was discovered. And so we've been studying it since then. And it's been this really fascinating problem of you have this building full of millions and millions and millions of pieces of paper that are not really organized in any way. And how do you go about studying that? And so we partnered with other experts from the American Statistical Association to design a random sample of the archive so that we could learn about it as quickly as possible. What are some of the learnings that you've discovered so far? Well, we've discovered so far is just the sheer magnitude of the archive. And in particular, the amount of documents that were generated during the conflict. And then the other thing that we have discovered is the communication flow, the pattern of documents being sent to and from leadership within the National Police Force. And specifically, Patrick Ball testified about that communication flow to help establish command responsibility for the former chief of police for a kidnapping that occurred in 1984. Wow, incredibly impactful work. But you've got some things on the domestic friendship with us a little bit about what you're working on. The state side. We do, yeah. In the past year, we've started our first US-based project which we're really excited about. And it's looking at the algorithms that are being used both in predictive policing and in criminal justice risk assessment. So decisions like whether or not someone should get bail or pre-trial hearings, things like that. And we've been working with partners, primarily lawyers to help assess sort of how are those algorithms working and what's the underlying data that's being fed into those algorithms. And what's the ways in which that data are biased and so the algorithms are replicating the bias that exists in the data. Tell me about how does that conversation go as a statistician with a lawyer who is a business person? What sort of educating do you need to do to them about the impact that this data can make and how imperative it is that it be accurate? Yeah, well, those conversations are really interesting because there's so much education going in both directions. Where both we are helping them to turn their substantive question into an analytical question and sort of develop it in a way that we can do an analysis to get at that question. But then they're also helping us to understand what's the way in which this information needs to be conveyed so that it holds up in court and so that it established some sort of precedent so that they can make policy change. Well, it makes me think of sort of the topic or the skill of communication. A number of our guests this morning on the program and those that we've heard speaking today talk about sort of the traditional data scientist skills hybrid hacker, someone that statistics mathematical skills. But now really looking at somebody who also has to have other behavioral skills being able to be creative, interpret the data but also to communicate it. I'd love to get your perspective as you've seen data science evolve in your own career. How have you maybe trained your team on the importance of communicating this information so that it has value and it has impact? Absolutely, no, I think creativity and communication are probably the two most important skills for a data scientist to have these days. And that's definitely something that on our team, it's always a painful process but every time we give a talk, if we're fortunate enough that it's been videoed, we always have to go back and watch that and I recommend to my teammates to do it quietly at home alone maybe with their preferred beverage of choice. But that's the way that you learn and you discover oh, I could have said that differently or I could have said that another way or I could have thought about a different way to present that because I do think that that's absolutely vital. I'm just curious what your perspective is from a curriculum standpoint, we've got a lot of students here, we've got some professors here. Is that something that you would recommend as part of, if you look back to your education, would you think, you know what, being able to understand statistics is one thing, I need to be able to communicate it. Was that something that was part of your curriculum or something that you think, you know what, that's a vital component of this? It's absolutely a vital component. It was not part of my formal curriculum but it was something that I got out of graduate school because I was very lucky that I got to teach essentially statistics 101 to introductory public health students. So they were graduate students but they were a lot of students who maybe hadn't had a math class in a decade and were fairly math-phobic. And so I really- Sounds like me. So we could, you know, hold hands and get through it together. Oh good. Gettridge at my church, awesome. Exactly. And I really feel like that was what improved my communication skills was the experience with those students and thinking about how to convey the information to that class and going in day after day and designing that curriculum and really thinking about how to teach that class is really the way that I sort of learned my communication skills. Oh that's fair. That real world experience though, there's nothing that beats that. What are some of the things that have excited you about participating in WIDS this year? Oh my gosh, it is so much fun to be in an audience and to speak to an audience that is so predominantly female. I mean, of course that's not something that we get to do very often. Right. And so young. I mean this audience is really full of very energetic, ready to go tackle the world's problems women and it's very invigorating for me. It helps me to kind of go back and think, all right, how can we do more and do bigger and create more opportunities for these folks to fill? It's a very symbiotic relationship I think. They learn so much from you and you're learning so much from them. It's really nice. You can feel it, right? You can feel it here. Absolutely. In this environment. Well, Megan, thank you so much for joining us on the program today. We wish you the best of luck with HRDAG and your impending new little girl. Appreciate that. Absolutely. We thank you for watching McHugh, but again, we're live at the Women in Data Science Conference at Stanford University, second annual event. Stick around, we'll be right back.