 Hi, I'm Phillip Stark. Thank you for coming to this virtual talk. I'm going to talk about testing ballot marking devices. This is joint work with an undergraduate student, Ron Shi, who visited Berkeley last year and did some research with me. Why do we need to test ballot marking devices? Well, they can print votes that differ from what the voters saw on the screen or heard through the audio interface. The idea of voter verifiability is not really refined enough to capture the security properties that we need in a voting system. In particular, the ability to catch an error and spoil the ballot and request another opportunity to vote isn't enough to make ballot marking devices safe voting technology. For example, recent research by Bernard Adol showed that only 7% of voters notice errors that ballot marking devices have introduced into the printout. In effect, the security properties of paper are undermined by using ballot marking devices to mark the paper. There's a problem with the BMD security model in general. It basically makes voters responsible not only for their own errors, but also for the overall security of the system. But they don't give voters the tools they need to do that job. In particular, there's no way for a voter to present any other party, including an election official, with evidence that a BMD misbehaved. So if a voter complains to a local election official, there's no way for the election official to know whether the complaint reflects an actual malfunction, a voter error, or a cry of wolf trying to undermine a trust election. As a result of that, error or malfeasance could change a lot of votes without raising any kind of alarm. A number of proponents of BMDs claim that they have a number that benefits, such as preventing overvotes, warning about undervotes, and eliminating the possibility of ambiguous marks. I think that there's problems with those arguments. In particular, they assume that ballot marking devices function correctly. And there are many recent examples of failures on a wide scale, including in the state of Georgia, Northampton, Pennsylvania, and then Los Angeles. Precinct count optical scan systems can also protect against overvotes and undervotes. In fact, that's required under BBSG 1.0. So how can we figure out whether ballot marking devices actually worked adequately in a given election? We need to know that whatever errors occurred we weren't numerous enough to change the outcome of any contest in the election. Three different approaches have been proposed to testing ballot marking devices. One is pre-election, logic, and accuracy testing, where you look at a machine before election day, run some test patterns through it, and verify that it prints the right thing. Another approach is passive testing, where you look at something like the spoiled ballot rate and try to detect anomalously high rates of spoiled ballots as a possible signal that the machines are misbehaving and voters are catching it. And the third approach is parallel or live testing, where testers periodically throughout the day on election day or during early voting period will mark some ballots but not cast them and verify that what's marked on the printout matches what their intent was. And the point of our research is to show that none of these, in fact, been working practice. Now, how much testing do we really need to do? That depends on how big a problem would make a material difference. And I have argued a long time that a sensible threshold for materiality is enough to change the reported winner of one or more contests. That is, we'd like to have high confidence that whatever errors occurred, they didn't alter who won. Many contests in the US are decided by less than 1%, even statewide contests. For example, in 2016 presidential election, the margin statewide contests in Michigan, Rhode Island, Pennsylvania, and Wisconsin were all under 1% with Michigan being as low as 0.22%. So I'm going to frame this as a two-person adversarial game and think about what strategies are available to the two players. So the evil doer is Mallory, who's trying to alter the outcome of one or more contests in an election. Mallory doesn't want to be detected. The point of this isn't to cast fear, uncertainty, and doubt. It's to get away with altering an election outcome. Mallory knows how the ballot marketing devices will be tested, in general, because that will be a matter of public record. That is an action taken by the local election official. Mallory knows the state history of the machine. Mallory knows how voters are interacting with it and knows what votes have been cast earlier in the day, how long each voting session took, and so on. And Mallory has a good model of voter behavior because Mallory can basically install spyware on voting machines and keep track of how voters interact with those machines in previous elections and on into the future. In contrast, Pat is our tester. Pat is trying to make sure that any ballot marketing device problem that alters one or more outcomes will be detected. In contrast to Mallory, Pat has to obey the law, has to protect voter privacy. Pat doesn't know which contest Mallory will attack, nor the strategy Mallory is going to use to attack them. So this is a very asymmetric problem. All right, so because the threshold for materiality depends on the number of votes that it takes to alter an election outcome, it's important to keep track of how big or how small elections are in the United States. The median turnout by county in 2017, not 3,017. Sorry, in the 3,017 US counties that there are in 2018 was a little under 3,000 voters. There are fewer than 43,000 voters in more than 2 thirds of US jurisdictions. And in 73% of states, more than 50% of counties have fewer than 30,000 active voters. That is, the median size of turnout in a county is 30,000 voters or fewer. In 92% of states, that number is 100,000. That is, more than 50% of counties have fewer than 100,000 active voters. In 2019, only 3,017 US cities had populations of 100,000 or more out of more than 19,000 incorporated places. So if about 80% of the population is a voting age and turnout is about 55%, which is roughly what it's been historically, then contests for elected officials in something like 98% of incorporated places involve fewer than 44,000 voters. So we need to think about ways of testing things on contests that involve fewer than 44,000 voters. And many contests will involve even fewer than 3,000 voters. The 2019 median population of US incorporated errors is about 725. So about 50% of incorporated places have a turnout of less than 320 voters. All right. This is just giving an idea how much of the country has had a median turnout in 2018 of less than 30,000 voters. It's the most of the country by area. So what's Mallory's strategy space? How can Mallory figure out which transactions or what votes to try to alter? Mallory basically can pick based on a very large number of state variables in the ballot marking device the time of day, how long the wait was between voters, how many people have voted on the machine already, how does this particular voter interact with the machine, including the selections, what contests the voter ignores, how many times the voter revises selections, how long the voter reviews things, whether the voter looks at every page of the candidates in a contest, how long the voter reviews selections, inactivity warnings, BMD settings, font sizes, languages, whether the voter uses the audio interface, the SIP and PUF interface. All of these things are available to a Maldew or to Mallory to try to hack the election. Now here are some examples of just how many different possible voting transactions there are. I'm giving two columns of numbers. The more realistic number is pretty realistic in the United States. Many states have or many jurisdictions have ballots that contain 20 or more contests, but we're gonna use three as basically a lower bound. And similarly, you can look at different variables that Mallory could use to target these things from the number of candidates for contests, languages, time of day, number of people who voted, time per selection, the settings that the voter uses, the contrast and saturation of the screen, font size, audio use, tempo, volume, and so on. So conservatively, there's on the order of six million different combinations of settings that are likely to have some reasonable probability of being used. More realistically, there's something over 10 to the 47th, a truly staggering number of possible voting transactions. There's no way to probe even a microscopic fraction of those using testing, either pre-election, logic, and accuracy, or live testing. So what can Pat do? That's what Mallory can do. Pat can monitor voter behavior in a non-invasive, non-privacy-invading way. In particular, can look at explode ballot rates. And Pat can try to catch a malfunction by using the BND before, during, or after an election that is doing logic and accuracy testing or live testing or post-mortem. So Pat really does have to test at random in some way. If Pat tests in a way that's predictable, such as once an hour or pulls only one machine aside and tests it or tests only some combinations of votes on the interaction of the machine in some particular way, then because Mallory knows what Pat's strategy is, Mallory can just avoid changing those transactions and hide. Similarly, Pat can't just set aside machines on election day for live testing. Pat needs to test the machines that are actually in use or Mallory could detect that the machine is being used in a way that is not typical of voters. And moreover, because there are so many possible combinations of settings, combinations of ways of transacting a vote, uniform random sampling is doomed. You really do need to sample more often from those transactions that voters are going to use more often in order to have a reasonable chance of sampling at least once from any set of transactions that contain enough votes to alter the outcome of one or more contests in the election. So ideal sampling would mimic voter behavior. It would basically sample from what voters actually do. So we're gonna look at exactly that. If suppose we could mimic voters perfectly, how many transactions will we actually need to use as tests in order to have a good chance of detecting outcome changing errors or alterations? So it's important to know that in a jurisdiction-wide contest, changing the votes on 1% of transactions can typically change the margin by 2%, but if there's undervotes, you can change it by even more than that. And if the contest is only on a fraction of ballots cast in the election, then you don't need to change even that large of a percentage to change a margin by a much larger percentage. For instance, if you have a contest that's only on, that only one in 10 voters is eligible to vote in, and the undervote rate is 30%, then changing the votes on 1% of transactions could change the margin in that contest by 29% as a lot of leverage. So passive testing relies on voters noticing errors and spoiling their ballots. Now in order to know how large a spoilage rate is enough to sound an alarm, we have to have a good idea of how often voters spoil ballots when the machines are functioning correctly. And then we have to know how often they will notice errors if errors happen and whether they will report those errors and thereby request a new ballot and trigger an alarm. Problem is that kind of training data is unlikely to be available in part because you can't step on the same election twice. There are all kinds of differences from election to election that are likely to change the spoiled ballot rate, including complexity of the ballot, ballot layout, complexity of the social choice functions, and so on. Now how do we set a threshold for one to sound an alarm if we're using passive auditing and passive testing? It's going to depend in part on the number of transactions Mallory alters, which votes are affected, which contests are affected, and so on. And Pat is not going to know any of these things. Pat needs to test in a way that is going to be sensitive enough to changing the outcome of any contest whatsoever. Now let's make some really optimistic assumptions and work through the numbers and figure out just how much testing Pat would have to do or how many voters would have to be voting in the particular contest so that a change in their spoiled ballot rate would be noticed. Well let's assume in particular that the spoiled ballot rate follows a, spoiled ballots follow a pass on distribution with a known rate if there's no hacking and a different known rate if there is hacking. Now there's no reason to assume that except it's a common model for things. This is really just a thought experiment. It's not intended to be a realistic model of how voters detect and spoil ballots. We're gonna look at contest margins of one to 5% and false positive and false negative rates of 5% and 1%. So a false positive rate is saying that there's a problem and there really isn't one. A false negative rate is failing to notice that there's a problem when in fact one or more outcomes have been altered. Now here's kind of how it plays out. This is for 5% rate of false negatives and false positives. If you look at this, going across the top is the base rate of spoiled ballots when things are clean. And then as you go from row to row, you're really looking at what is the rate of errors in the printouts that would be required to reverse a margin of size 1%, 2%, 3% on to 5%. The detection rate we're assuming is either 7% or 25%. 7% is consistent with what Bernard et al found in their study of actual voters in an experiment that wasn't an actual election. So if you look at this to have a 5% of false positive and false negative rate, you'd need on the order of half a million ballots or more for realistic rate of voters detecting errors and spoiling their ballots to protect against altering a contest with a margin of 1%. That number goes down as the margin gets wider, but as I've already argued, there are a very large number of contests that are decided by 1% or less important contests. If we make a more stringent threshold of requiring only a 1% rate of false negatives or false positives, then we would need on the order of a million voters or more in the contest in order to be able to detect an alteration to enough ballots to reverse a margin of 1%. So let's think about how big this number, half a million or a million is in the context of actual elections. And so I'm gonna use California as an example. 41 of California's 58 counties had fewer than 100,000 voters in the 2018 midterm election. So passive auditing would not have worked for any of those. 33 had fewer than 100,000 voters in the 2016 presidential election. So again, passive auditing would not have given you an acceptably low false positive and false negative rate, even under these optimistic assumptions that everything follows a pass on distribution with a known rate. So passive testing couldn't have protected contests with margins of 3% or smaller in those jurisdictions that have 100,000 or fewer voters. In many California counties turned out so small that there would be no way to detect problems through spoiler traits without having an unacceptably high rate of false alarms would be invalidating elections left and right. Okay. So that analysis assumed that votes were being changed more or less at random, that every voter had some chance of having his or her votes altered. But in fact, Mallory has access to information to use to target the attack against voters who are less likely to notice problems or less likely to spoil a ballot if there is a problem. In particular, Mallory could target voters with visual impairments or voters who are blind. If such voters, current ballot marking devices don't provide such voters a technology to check whether the printout actually matches the voter's intentions or what the voter was told on the audio output or on the screen in a larger font size. So if 2% of voters have a visual impairment that would prevent them from checking the printout directly themselves, then Mallory could change the outcomes of jurisdiction-wide contests that have margins of 4% or more without increasing the spoiled ballot rate at all because those voters would have no opportunity to notice that there was a problem in the printout. Voters that have some kind of motor impairments that make it difficult for them to, that limited dexterity that makes it difficult for them to handle a piece of paper, some ballot marking devices have an accessibility feature that allows a voter to cast the ballot without actually handling it. In some cases, those features don't even print the ballot until the voter has said, I want you to cast this for me after you print it. Because that doesn't even give the voter the opportunity to look at the piece of paper, the ballot marking device can cheat with impunity on ballots like that. So if there are enough voters who are using this auto cast feature, then relatively wide margins can be altered without any possibility of detection. Languages other than in English, voters, if a voter is looking at a ballot on screen in one language, but then printing it out in English, the voter might be less likely to check the printout if the voter is not so comfortable in English. Moreover, if a voter who is a, clearly a native foreign language speaker reports a problem with a ballot, it might be the case that some poll workers would be less likely to believe that the voter actually detected a problem with a device and rather more likely to believe that the voter made a mistake. There's all kinds of things that Mallory can monitor to try to target the attack by looking at how much attention the voter is paying, whether the voter is in a hurry, whether the voter is reviewing selections and so on. All right. So there are other problems with passive testing. Among them, it becomes a really easy way to raise a thud attack if you're uncertainty in doubt by simply asking voters to spoil their ballots more often, casting doubt on the election. All right, so let's look at these oracle bounds now. Suppose that instead of relying on voters to run the tests, we have to, we're gonna try something like parallel testing, active testing or logic and accuracy testing where we input test patterns and look at what comes out. So if we had perfect knowledge of voter behavior, that is if basically Pat could pick voters at random and look over their shoulders while they vote and see whether the printout matched what the vote was presented on the screen or in the audio output, that's kind of a best case scenario that doesn't involve Pat having to figure out the distribution of voter transactions with the BMD. Though even under those circumstances, it takes a fair number of votes in order to, you have to spy on a fair number of voters, have to look over the shoulder a fair number of voters to have a reasonable chance of noticing a problem in the election. So for instance, if Mallory alters 15 transactions in a contest that has a little under 3,000 voters, which was the 2018 median jurisdiction turnout, that could change the outcome of a contest by 1% or more, but Pat would need to look over the shoulder of at least 540 voters, about 18% of the capacity of the machine. That would involve testing each ballot marking device several times an hour, once an hour would not be enough. If you're limiting things to once an hour for 13 hours a day, then to have a 95% chance of catching a problem, you have to have over 6,500 voters in the contest, which is almost triple the median turnout in jurisdictions in the US, and 20 times the median number of active voters in incorporated areas. In reality, Pat can't shoulder serve, Pat needs to make a model of voter behavior, and that's gotta be calibrated to data. That's gonna require monitoring voters in extreme detail, all of the details I mentioned before relating to voting transactions. That would compromise voter privacy completely, and probably be illegal. Nonetheless, let's imagine that Pat had a budget of running an infinite number of tests using a particular model. How many voters would Pat need to observe in order to get a model that was accurate enough to detect an alteration to some fraction of the votes? So to have 99% confidence of detecting a change to a half percent of the votes, even if Pat could conduct an infinite number of tests in order to model behavior well enough to detect things at that confidence level would involve monitoring three and three quarter million voters in excruciating detail. And of course, their behavior in one election might not match their behavior in another election, and many jurisdictions that you need to be, or you need to be monitoring, aren't big enough. They just simply don't have that many voters. As you look at larger margins and lower confidence levels that bound goes down, but even from 95% confidence of detecting alterations to 5% of the votes, you'd have to observe more than a million voters. These are very, very conservative bounds. If you had a test limit of 2,000 tests to conduct, then the number of voters you would need to observe, you need to have an even more accurate model would have to observe even more voters. So the message here is you really have to look at, monitor a million voters in excruciating detail, at least in order to have a reasonable chance of picking up outcome-changing errors. All right, so the situation is really not very good. I mean, it's actually worse than this. Even if you were able to do that much testing, if you find a problem, the only remedies in your election, you have no idea which transactions were altered, what the right outcome should be. Margins aren't known before the testing happens. If it turns out that the margin is smaller than you allocated testing for, while the election was going on, there's no way to go back and fix that. The tests themselves, oops, sorry, have uncertainty, and that means that you really need to factor that in and deciding who really won. Is there really strong enough evidence that someone really won if there's only a 95% chance you would have seen a problem that big? Moreover, this is going to require new systems, extra hardware, additional staff, and additional training. This is a very expensive proposition, even if it could possibly be mounted. Our conclusion is that also, as I mentioned before, while BMDs are widely touted as helping some groups of voters, in fact, BMDs pose an ideal vector for disenfranchising those very same groups of voters because they will interact with the machines in specific ways. In short, there doesn't really seem to be a way to rescue the trustworthiness of elections if you're casting most of the votes on ballot marking devices or a substantial fraction of votes on ballot marking devices. And prudent election administration would involve minimizing the number of voters who use ballot marking devices, really reserving them for their accessibility benefits, where those accessibility benefits really do help particular groups of voters. Thank you very much for your attention.