 Today, I want to tell you about a project in the works to use data science to help verify elections and Although this project started in earnest only in October We are going to have a publicly available version of it up and running at verified voting for the November election But I first started thinking about this project this project of using data to help verify elections back in 2005 and At the time I had what seemed to me a pretty simple idea Which was to take the number of ballots cast in each polling place and compare it to the number of voters who signed in in each polling place and Just to make sure that the one wasn't bigger than the other and to use that as a check on Elections and on whether an election was fair whether you can verify. How do you verify that the election was? run correctly Well, I thought it was a pretty simple idea and it turned out that actually it was very complicated idea Because though the math of it is simple the technology of it seems simple. It's just subtraction You have to have the power to get it done You have to have the power to get the data you have to have the power to get the data in the right format and To get the data in a time frame that it's useful Some thanks are in order I would like to Thank the National Science Foundation for funding and in particular Jeremy Epstein for believing in the project. I Would like to thank my team at Portland State Eric Tsai and Raghu's from Gavarapu And also the the staff and faculty at the Hatfield School at Portland State University I'd like to thank verified voting and in particular Marion Schneider for supporting the project and having the vision to make it a reality even on a short notice and Finally, I'd like to thank the organizers of DEF CON for fulfilling a lifelong dream I never knew I had namely to be active in the chat of my own presentation So thanks to everyone. Let's start with a real story from a real election in 2018 there were congressional elections all over the country including in North Carolina and in the ninth district after election day there were some questions and There was one piece of simple data analysis in particular That had a big impact. It was a bar chart and it looked like this What you can see in this chart is that on the absentee ballots Something was very different in Bladen County from all the other counties now Anomalies are quite common in elections and there are all kinds of explanations They never tell you what happened. They only tell you where you might look and It's perfectly possible just from looking at the graph that there's a reasonable explanation so for example maybe the Republican candidate grew up in Bladen County lived in Bladen County and was Contributing citizen in Bladen County for a long time and had made a lot of relationships and Had done better there for real concrete idiosyncratic reasons But as people who knew the district in North Carolina looked at it more and more they they couldn't find an explanation like that they couldn't find an explanation that made sense in terms of the differences that they knew were there between the different counties and this data analysis was part of what raised the profile of the problem in this contest and It eventually an investigation was done the Democratic candidate pushed for an investigation and when the investigation was done then they found that in fact someone had cheated on the absentee ballots in Bladen County and They found enough through this investigation They found enough compromised ballots that they decided The North Carolina Board of Elections decided to not to certify the results and to have a do-over Do-overs in elections are really rare They are expensive They're upsetting and for a while the citizens of North Carolina's ninth district did not have a representative in Congress but it was the right thing to do and that's Not a happy story while it was happening But sometimes that's what verifying an election looks like now. There are two really important details about this story One detail is that the investigation happened Before the Board of Elections made its decision about certification If there had been some kind of investigation after certification it Probably wouldn't have made a difference and that's because the way elections are you really have to at some point Decide who won and move on decide who won someone takes office someone takes power. So the critical time for elections is Before the results are certified the second thing about this story is that the investigation happened and Was taken seriously by the Board of Elections because a candidate Insisted on it the losing candidate. In fact insisted on it and That's as a technologist wanting to help use data to verify elections That's a point we can't afford to forget because it's the candidates who have the power and the standing to Sue in court or to press a Board of Elections to do an investigation. So as technologists If we want to have a high impact We want to somehow dovetail our interest in free and fair elections with the candidate's self-interest with the candidates interest in winning I've done a lot of both partisan and non-partisan work in elections and When you're taking the non-partisan side, it's really tempted to tempting to think of partisans as somehow Ah Evil or compromised or not as morally grand as non-partisan but in fact in America and Probably I would guess all over the world The real hard work of holding election Elections and election administration Accountable is done by the people who have the most skin in the game It's done by the partisans. It's done by the people who are competing to win the election So the most powerful thing we can do as Technologists in applying technology to this particular issue. How do you make sure that investigations that should happen do happen? We We do it most effectively when we have in mind How can we help candidates insist on investigations and in fact help losing candidates insist on Investigations when those investigations are justified what we've built and what will be available live online After the November election actually it will be available beforehand for folks to play around with but will be there for the November election is a system that allows a candidate or really anybody to go and Look at visualizations of data of election results from the election before certification and It allows them to look at anomalies anomalies like the one in Bladen County if they exist and Some people might say that this is gonna open up a big can of worms because there are lots of Anomalies in elections elections in my experience are full of anomalies and in my experience most of them are completely completely legitimate so Are we encouraging mass hysteria? Here's here's how I view it. So the only people Well, the people with the most power to push for investigations really are the candidates and while candidates Tend to be in the candidate bubble and I can tell you from personal experience You always think you're gonna win You always think you did win the the staff of the campaign and the people who back the candidate both morally and politically and emotionally and with their dollars tend to have a more reasonable view of things and candidates don't often insist for long on investigations into Problems that aren't really problems. I really trust candidates to do that and The point is that and I am also willing to take a few candidates who maybe won't do that I still think it's worthwhile because if the anomaly is for a real reason an investigation should turn that up so my goal is to think about after the election my goal is to increase the confidence in the election results after the election and part of building that confidence is The idea that if there are investigations that should have happened They happened Here's how the system works You choose a state Let's say you choose North Carolina and then you choose a contest type So you could look at presidential you could look at congressional you could look at state house contests And once you've done that Then you pick a particular contest or a group of contests So you could pick all congressional contests or you could pick any particular congressional contest if you had chosen contest type congressional Then what you get is Some bar charts, which the system has chosen to present to you. So the system is presenting anomalies anomalies of interest and if you ran this on the North Carolina 2018 data The system would show you as one of the most anomalous charts and most interesting charts The Bladen County anomaly So if you look at the top chart, it's in a slightly different order Bladen is on the left because it's on it is the anomaly And if you compare it to the original chart that was published by the political science professor and campaign consultant in North Carolina right after the election You see that's the chart. It's absentee ballots accepted absentee by mail ballots and it's by county and Something anomalous happened in Bladen County So that's a case where the anomaly Really makes a difference in the outcome of the contest And you can see that actually that it's that it's an important anomaly if you look Underneath the bar chart. You can see it says votes at stake 110 Margin 900 This is this is a measure of how much impact This anomaly has on the margin So if indeed it's an anomaly that's due to something that could be undone Then that tells you how valuable it might be to the losing candidate who in this case was Dan McCready another anomaly that the system finds for North Carolina 08 is sorry in 2018 is The anomaly chart on the bottom Here the anomaly is that there seem to be no provisional votes in Randolph County, whereas there were provisional vote votes in All the other counties. This is for the sixth congressional district and It's it's not that important in terms of the overall votes at stake versus the margin because you can look and See that the votes at stake just 25 about and the margin was about 37,000 So what's going on here? Well, it turned out that there were provisional votes in Randolph County in this contest and the North Carolina Board of Elections had them and they were part of the official results And they were part of what you would see if you went to look them up online Through their usual interface They just hadn't made it into the data file that the Board of Elections provided for download And I want to point out that while the Blayton County anomaly was known before we did our work and You know, you have every reason to question whether I reversed engineered the system so it would pop out The this anomaly no one had noticed before as far as I know and When we pointed it out to the Board of Elections, they corrected the data file and posted a new one so this shows that Anomaly detection can help boards of elections with their process of continuous improvement. So that's a second application of this work so that's the bar chart part of the System where the system chooses anomalies to show you there's another part of the system Which allows you just to play around comparing various counts in scatter plots So for example, we're still in 2018. This is Florida if you compare the United States Senate votes cast total votes in the contest to governor of Florida and you compare by county you get the following chart and The interesting thing is that there is you can see that all the counties line up But one of them is a little bit off and that county is Broward County in that county There was an under vote for US Senate when people Investigated what they found was that the Broward County ballot was poorly designed it had it had a design flaw which Which made it pretty easy for people to inadvertently miss the US Senate contest and It resulted in a significant undervote and you can see that on this chart Well, I want to end by showing you my absolute favorite scatter plot chart of all time This is from the Philadelphia 65th Ward by precinct in 2011 my election contest against March Tartaglione the famous March Tartaglione and You can see that there is a real outlier. I mean, that's a serious outlier. That's like a hundred votes worth outlier so if I Were March Tartaglione and if the margin had been small, which it wasn't I would Be wanting to know what had happened in that Precinct and even if I just were a person who cared about Elections and free and fair elections. I might want to know what happened in that precinct And if someone had come to me and said, you know what we saw this outlier and we're gonna investigate it and I hope you have an explanation. I Would have said yes, I have an explanation That is the precinct where my daughter stood outside all day and said vote for my mom vote for my mom vote for my mom and This points to a third application of this system, which is to political science What really makes a difference in voter behavior? One way to figure that out is to look at outliers and to get explanations for outliers There's always something interesting behind an outlier Let me say a little bit about the anomaly detection algorithm So first of all when I first built the just the first working prototype of the system I I use just the simplest anomaly detection algorithm that fell to hand which was to To Think of let's say you're taking a particular contest by county some certain set of ballots and then for each county If you have three candidates, then you have a vector in three space that is the Vote totals for those three candidates Or you can look at the percentage splits You can have that be your vector in three space in any case if you take the vector for each county You can think of them as points in three space And then you can just there's a very simple outlier detection thing called Z score So you can just calculate for each point how far it is from all the other points and then apply the Z score to to figure out what's that outlier in Distance from all the other points So this is you know if you think about it in depth This is the Z score is the wrong thing to use because you don't have a normal distribution and and stuff like that But just throwing it together and see if it worked I ran it on the North Carolina data and out came the Bladen County anomaly As one of the most anomalous slices in North Carolina so That's a pretty good proof of concept now we are doing something more sophisticated and we're certainly interested in Doing a variety of different things So the tension here is between wanting to use percentages Because that levels the playing field between counties Counties are of widely varying size in pretty much every state Maybe not geographically but but in terms of population And the same is true even of precincts often or any other subdivision you might use um so So the argument for using percents is that it takes away that variability, which isn't what you want to look at anyway So on the other hand What really matters In the in the end is how many votes are involved So that feels like an argument for using vote totals and not percentages And the way we're splitting the difference there is that we are using percentages To identify the outline point And then once we have if we have a case where we have a point that is an outlier Then We use the vote totals to calculate How many votes are at stake? meaning Right now What would the change in the margin be if you Assume that that outlier count if you alter it back to To fit in better with the other counts Um How much does that change the margin by what percentage of the margin does it change? So how what's the ratio of the votes at stake to the margin and if that ratio is one that means that if this outlier really is Is not a true good vote count and if it were changed Uh, it might change the outcome of the contest. So that's clearly important if if that ratio was one percent then um That outlier it might be interesting But not from the point of view of would it change the outcome of the contest So from the point of view of the losing candidate, that's not an interesting anomaly at all If it's one percent of the margin. So so we use this percentage of margin As a way of scoring which anomalies are likely to be of most interest Let me say a word about what data will actually be available So back in 2005 when I was thinking about About how easy it would be to compare ballots cast to the number of voters checked in at the polling place One of the most naive things Was that I had no idea how hard it was to get election data never mentioned to get it in time before certification um I had to actually sue the commonwealth of pennsylvania to get uh voter file data I had to Threaten to sue the board of elections in philadelphia to get election results at precinct level in electronic form And eventually I had to run for the board of elections myself, which by the way I highly recommend Even though it was simultaneously the best and worst experience of my life serving. It was very difficult But it was important and very satisfying and the more technologists we have in office in particular in offices that Have some say over the conduct of elections the better elections are going to be in this country So but What data is available? What format it's in and how quickly you can get get it all of this varies wildly all over this country and some states like north carolina virginia do an awesome awesome job of putting the data out there and uh other states don't and we will collect as fast as we can as much as we can and That's a place actually where we also can use help if people want to be part of the collecting. That's awesome While we're on the subject of collecting the data You'll see if you go visit the Code repository on github. You'll see that a good amount of it most of it in fact is dedicated to Munging the data to to taking the data from whatever format it arrives in and putting it into a common data format So that the analysis algorithms that we have Will be applicable to all of the data that we have That's something actually that i'm i'm really proud of in the project already It's never easy to take data with from different formats and put it into a common format This makes the process as straightforward as possible So we are looking for people to contribute to the collection effort But we are also looking for people to contribute to building the system So um, I should say I should also thank the national science foundation which funded the building of the back end And it's the back end that you'll see uh on on github and that you should feel free to take and use however you like Well if you've stuck with this talk this long Then maybe you're even interested in thinking about contributing to the project. We would love that We would love to have people build visualizations and analysis on top of the standardized data that we can now provide We are looking for someone or some ones to build tools that will pull data from apis into our common data format and also take the Data that we have ourselves and to put it out in the common data format that has been developed by The national institute of standards and technology We're also looking for people To help with documentation just making sure our documentation is clear um We're looking for people who are interested in merging in other data sets There is a lot of potential here to do analysis not based only on election results But also on other election related data and even data that maybe doesn't seem at first glance to be election related, but um I always think of the weather It's uh, you know, it's one of these folk truths in elections that uh, That weather affects who comes out for elections There will be data about Covid there will be data about different ways that different jurisdictions have handled vote by mail There's going to be all kinds of interesting data And uh, we really would love to find people Who want to merge in some of that data and build analysis on top of that combination of data And finally, if you have experience building a successful open source community It would be terrific to get your input and get your help in building this as a Long-lived open source community Because we're focused now on 2020 But there's a lot of potential for growth here