 Good morning, everyone. I think I'm going to get started just a touch early. So my name is Tim Mackey. I am a principal security strategist within the Cyber Research Center over at Synopsys, which is one of the three divisions of our company's goodness. And today, we're going to be talking about privacy, consent, trust, and security within modern application, modern application development. This is a shortened version of a talk that I gave at Black Hat a couple of weeks back. And this has been also tuned towards an open source scenario. The key takeaway that I would love everyone to have is that there's a lot that we can do as practitioners that will impact the overall security of the data that's being processed if we start thinking about what it means to collect and process data. And so I'm not a lawyer. I'm a techie guy. So my lawyers like me to say things like, please don't take everything that I have in here as legal advice because, well, 30 minute session, probably not a best idea. So I'm going to give a whole bunch of examples about things that we could be doing better and what the true implications are. So from a starting perspective, being a security target today is a hugely, hugely big deal. So in a three week period from May into June, we had a disclosure from a company called AMCA who was breached. They happened to be a credit processing company behind LabCorp and Quest Diagnostics, which would do lab work, blood work that your doctor might prescribe. We had records from the Customs and Border Protection people for faces that they had collected and information that they collected on travelers. And we had a whole bunch of information from a dating site that believed that encryption was really, really just a bunch of, well, if I hash it in XOR and we're good, right? So every year in July for the last eight or 10 years, IBM, in conjunction with the Poneman Institute, put out a report that says, here's what it actually cost for a data breach in the last year's data. So this is what 2018's data look like reported this year. The average cost of a breach went up from $4.2 million to $8.19 million. Customer turnover went up quite dramatically. But on the plus side, and this is a huge plus, for the first time in the last four years, we've actually gone down in the length of time it takes to identify and contain a breach. We're now only at 245 days, unless of course we're in health care, in which case we're at 412. And so all of this is framing the attack landscape that we effectively live within. And as we have more and more data and more and more data driven environment, regulators are kind of saying, but wait, hang on a minute, you guys have got it wrong. And so most people at this point in corporate application development have come across something called GDPR, which is a European regulation. And you may even see at our booth that we've got a little sign that says how we collect data, which is one of the requirements. Canada has their version called PIP EDA. Australia has the Notifiable Data Breaches Act. India is in the midst of trying to figure out what theirs would be. So there's a draft bill called Bill 21 2018, because well, it's 2018 that they came up with it. Now, while the preceding ones are very centered on what to do around data that is being managed, the Indian one has an additional provision around data sovereignty. And it basically says the authoritative data source for all Indian residents must be maintained in India. That's a bit of a change. California has the Consumer Protection Act, which nominally comes online in January of this year. But that has been delayed through a series of amendments to maybe June, maybe July. I'm still a little bit squishy. But because all of this data is really, really sensitive, and we see large organizations like your Googles and your Facebooks and your Twitters and so forth having access to all of this, it's not just the regulations that we have, but it's the actual what do we do next. And so yesterday, a group of state's attorney general decided that they're going to start probing in a little bit more serious way what actual data management is happening within big tech. And so they're going to take a look through, not necessarily the technical lens that we would all love them to look at, but they're going to take a look at it through, I'm a politician and I'm a lawyer, and that might not be exactly what we expect. So from an open source perspective, one of the interesting things that happened was the whole Equifax data breach, which was an attack vector through an unpatched instance of Apache struts brought a lot of focus on, well, what is open source? Who is the vendor of open source? Where did I get this from? Gee whiz, they don't seem to be doing security all that well, so maybe I need to go to a different vendor. Who can I go to a different vendor than this open source? And they asked a lot of questions that just didn't make a whole lot of sense, and in one part of our business, we actually deal with that kind of downstream. I like to be a student of other people's failures. I like to learn from them so I don't make the mistakes, and one of the things that I encourage everyone in this room to do is to read the Senate report on the Equifax breach. I refer to this lovingly as 72 pages of absolute awesomeness. This was about a two-year investigation, and it found, as you would expect, that the majority of the issues associated with the Equifax breach had absolutely nothing to do with the actual technology per se, but really were a people process breakdown. Off the top of my head, I believe it was 8,500 unpatched vulnerabilities within their software infrastructure. The person who, how CVE information was disseminated was a 400-person distribution list, where all new CVEs would just kind of go in at the top, and given the volume of CVEs that we see last year was a little over 17,000 new ones. If you're getting 17,000 emails from a distribution list, if you're anything like me, you've now built a mail rule that says I'm going to ignore most of that. Turns out that the owner of the system that actually was the breach vector wasn't on that 400-person mailing list, and when he asked the question, his boss said, no, you really don't want to be on that list, trust me, dude. And so there was a long period of time, and so one of the interesting things from an identification and remediation perspective is that it actually took Equifax less than 90 days to identify and contain the breach. So that 245 number, Equifax did way better than the average. Now, when we look at how data is managed, there are a few truisms, and rule number one is you can't possibly secure data, you don't know your processing. So that needs us to have a couple of definitions, and there are a few notes in here that there's legal nuances around this, so again, I don't play one on TV, I just read stuff. So one of the things that I actually get questioned about is why don't we have a version of GDPR in the US? Why do we have a patchwork of each state having their own version of what data protection really means? And I like to look at, we do have some national standards around this, healthcare information, anyone who's ever gone to a doctor, hospital, clinician, had blood drawn, you've got the litany of papers that say, here's what's going to happen, most people don't read them because, well, they wouldn't understand half of what was in there anyway, but it basically defines what data means, and here's their definition. Normal people tend to define data as any information about myself which is provided to somebody else in order to give me some kind of service, that's data, and within my development team, because we're from the Northeast, we modify things a little bit, and it's actually data, so if you hear me say that, that's where that's coming from. So from a data privacy perspective, we need a definition as well, and so ISO very nicely defines it in 2009, 2011 as a set of shared values governing the privacy protection, so when you're defining privacy, using the word privacy in your definition, so yeah. There is no common legal definition across the board. Using the Canadian example, it is not actually defined, but it has a set of protections around right to life, liberty, security of the person, and in the US we have similar things around some of the constitutional amendments that give us the beginnings of what privacy might look like, but they're not necessarily defined within a legal construct. Now, of course, if I'm a normal person, I say that that's an expectation that I have whatever data I just gave you is going to be properly protected. So what are you going to do about that? And so I'm going to give an example, and I can give the whole story behind this, but I'm just going to highlight the little yellow bits. So we've all gotten emails from somebody else that we didn't expect to have, and so I got one from Wells Fargo. The person opened an account and typoed the email address and it says, you requested an update to your email, this link will expire 90 days. So I ignored that, figured that there must be something behind it, whatever. Couple days later, I got, oh, wait a minute, you've added this person as a NewZell recipient, here's their phone number and they're now going to receive this, oh, okay, maybe I should try and do something. Then a day later, wait a minute, here are all these transactions. So wait a minute, I just sent out a whole bunch, I just received a whole bunch of information that was sent to me, assuming that that email address was legit. Try calling them and explaining the situation and seeing if they can resolve it. That's a fun conversation. Another variation on this. So Plum Benefits is a company that does benefits management for smaller organizations. In this instance, one of the people decided that they were going to book a hotel. In Baltimore, by the airport, on these dates, here's their name. So what do I have here that I can use to go and impersonate them? What do I now know about them that could facilitate a different type of attack? This was a one-off, turns out that it was legit. They were surprised that I received it. Try having that conversation that I didn't just just my email address. True as a number two, if your users don't know what you're doing with their data, you increase the reputational risk to your organization if something goes wrong. And this gets us to consent. And consent is a hugely tricky construct. GDPR defines it as all of that and I'm not gonna read it for you. All the slides are up on the thing or actually I think I need to put an update and I did a couple of tweaks in here. But it basically boils down to, before I consent to anything, I need to know what you're doing with my data. Who you're gonna share it with? How long it's gonna be around? And oh, by the way, I can change my mind later. So just because I consented right now doesn't mean that it's going to be in perpetuity. Similarly, trust is a really complicated concept and there's no legal framework around this. I went back and forth with our internal attorneys around what we could call this from a legal perspective, couldn't come up with anything. So the PR agency defines it as reputational risk as the risk of loss associated from damage to an organization's reputation. The AMA defines it as, from a consumer perspective, brand equity is based on consumer attitudes about positive brand attributes and favorable consequences of brand use. Would you trust a brand that just leaked 100 million, a billion records? They're in the news every day. People are growing numb to these things. Common definition? I don't just trust my data to everyone. If the provider is one whose brand I trust or one that solves a real problem for me, I'm more likely. So it's not an absolute to give them data. But again, I reserve the right to change my mind. So in the black hat version of this, it was the top 10, but we're open sourcey so I can go a little bit more detailed in here. So top 10 questions. Does the person who you're collecting the data from and from an open source perspective, this is huge because the components could be used anywhere in any context. Is there clarity around why the data is being collected? If it's being sent to a third party, is the user clear on who that might be? Because if they get breached, I need a way to know that I should be maybe doing something different. Is there an opt-in or opt-out context around this? Under GDPR, opt-out isn't consent, opt-in is. Who internally could have access to the data? Maybe I'm logging every attribute that comes in. Well that was kind of the movie pass example from yesterday where there's a whole bunch of splunk data that was up in an unsecured S3 bucket and then includes all of your card data. Oops. Does it require any specialized processing? Encryption hashes. In New York State, there's a draft bill called the SHIELD Act and in the SHIELD Act it says that all sensitive data must be encrypted. Doesn't specify how or what sensitive might be. So I figure MD4 is probably good enough. How would you know if somebody accessed the data in the first place? Have you informed the consumer how long you're going to have it for? I can't remember the name of the organization now but three weeks ago there was an organization that had phone records from customer support agents that go back to 2015. Why? Oh, for the purposes of training. The person on the call probably isn't an employee there anymore because they're in a call center. What's the purpose? If regulations allow the user to delete or correct the data, what's the process? How would the user know? What data is being transferred as part of a phone home mechanism? We love phone home mechanisms in tech these days. Do people know that that's happening? If a web service is used, how is that transmitted data handled? Where does it go downstream? These are really, really key questions that everyone can be asking for every piece of software that we're developing. So GDPR had a reputation of, well, we're just gonna solve data breach problems. But in January, Google found out that it's not quite so simple. So the onboarding experience for a new Android device was brought to the French regulator CNIL because it included a perceived obligation to go and have a Google account. And there were a bunch of other things that were, hey, you can opt out. Google found much to its great dismay that despite having a European presence in Ireland, no data processing decisions and data security decisions were actually occurring in that location. So French regulators had jurisdiction. French regulators annoyingly prefer to do things in French. And when you're an American company, French law doesn't always necessarily make the same amount of sense as you would expect. So Google complained. But in the end, Google now assigned what's known as a data protection officer to be part of the Dublin operations for Google. And there was an entire reorganization in December around how those operations were going to be occurring. Fast forward to two weeks ago and German regulators in Hamburg decided that it would be a fantastically brilliant idea to say, well, let's invoke article 66 of GDPR which says that there's an egregious issue going on and we need to stop this now, the moral equivalent of an injunction. And they were upset about Google assistance that were having some aspect of the data being processed by humans. But that was never disclosed. So they said, we're gonna shut all this down. As it turns out, Apple was doing the same, they shut it down, Amazon was doing the same, they shut it down, or at least they've deferred it. It's a 90 day period where they gotta figure out what the right answer is. But change in application and product behavior. And that could have a serious implication to what the future might look like for data. So when a data incident occurs, the only data that can ever possibly be exfiltrated, which is sent out and stolen, is the data you retained in the first place. So if you don't have a need for the data, say there's no regulatory requirement to keep it for longer than a couple of days, couple of weeks, why are you keeping it? That's just an open invitation to something getting out there that shouldn't. And that can have serious reputational implications when, say, the company you're working for happens to get bought. As Marriott found out after acquiring Starwood and finding out that there was a whole lot of passport data that had been being breached for a multi-year period starting in 2014. And so you can have goodwill impairments, SEC filing requirements. You could have churn of customers, a whole lot of Marriott customers who said, yeah, you know what, no, thank you, I'm going someplace else. Potential for bankruptcy reorganization, AMCA, the company that was behind the Quest Diagnostics breach, they lost 90% of their customer's revenue in a span of about a week, and they filed for Chapter 13 bankruptcy protection. They have two customers left. And they're trying to figure out what this actually means. Obviously regulatory finds potential for impact on the supply chain and increased cost of customer acquisition. Do I really want to trust this brand? What does it mean to reacquire? And we see that throughout brand management, independent of cybersecurity issues. Number four, as applications evolve, original decisions around data collection become opaque. So the moral equivalent of this is, woohoo, I've got data. What can I do with this data? So as long as I've got that, I'm going to find a good way to do something interesting with it. And it's probably not something that the people who gave it to you originally thought about. So that gets us to the idea of anonymizing data. So in July, Imperial College of London put out a report where they were able to take some anonymized data sets and basically say, yeah, you know what? This person who suffers from this illness who happens to be receiving care and drives to this location in a Vauxhall Astra, well, their name is because they could marry all of this data together and figure out who drove a Vauxhall Astra and lived in this location. Couple days ago in Australia, the Office of Victorian Information Commissioner looked at a few million records that were released under the MyKey travel information. So this is a smart card that is used to go and get into trams and buses and so forth. And the commissioner had released anonymized information for this which said, I'm not going to include the card number, I'm gonna put a UUID behind it and I'm now going to be able to connect the dots and say, well, this person got on here and got off there. And so now, well, what's the average duration? Lots of research value to this type of information. Up until the point where they said, you know, it looks like these people are always getting on here and getting off there and it looks like on the way home they do this. So isn't that the minister of something? In the Australian government, aren't these people, we don't know if with 100% certainty, but. And so this is what can happen when you have anonymized data that gets married with other publicly accessible information, like say an overlay on a Google map and suddenly you can pin things and say, well, what's going on? And so when releasing data sets, which personally I'm in favor of any kind of research that we can do on this, we need to be very, very cognizant of the fact that the data is going to be married to something else eventually. Maybe not today, maybe not next week, but eventually. Number five, given access to data, people will find a way to use it and potentially misuse it. Lots of examples of this, but the real challenge is that today an application is really a mashup of things. And so web services APIs really change stuff. I fly Delta Airlines a lot, onboard Delta Airlines, they have the ability to buy stuff and you swipe your card. Behind that was a company called 24-7-IO, 24-7-IO was breached a year and a half ago. I had never heard of them up until that point, but I got a nice little letter from Delta saying, I'm sorry, but your credit card appears to have been used as part of the, it's like what? And so the supply chain of web services becomes a real interesting scenario. And so you have to ask these kinds of hard questions. And managing consent can be really complicated. So you've got voice assistants. I set up my voice assistant. I give consent. My girlfriend's in the room. What was her consent to potentially being recorded if the thing is always on? If there's a software update and I perform the update, but my colleague originally gave consent, is that a legitimate scenario? I thought I'm on a mobile device and I have a jurisdiction associated with consent. Is my jurisdiction where my mobile device was accessing legit? What if I'm over Wi-Fi versus 4G or 5G? What if I change my mind? How do I identify myself in a way that will allow for me to find out what you've got? So if I'm an innocent third party and there's an Alexa on the table, how do I identify myself to Google to say, I want to have all data associated with me? These are hard problems that we still haven't solved today, which is why we need to understand when we're collecting data, why we're collecting it, and make certain we don't have it for any longer than we need to. So I'm gonna show how this could evolve in a very simple scenario. I want to design a nice new shiny IoT device because that's kinda all the rage. That statement has a lot of implications. So I have an IoT device, hunk of hardware, gonna be bolted to the wall, needs to be cheap, do exactly what I wanna do, and did I mention cheap? So I'm gonna choose whatever I can do to get the costs on that down. Now that has to be configured, so I'm probably gonna have a mobile device, I'm gonna scan a QR, and I'm going to have a Bluetooth low energy connection to it, and it's now going to inherit my Wi-Fi connection. That device is probably going to need to communicate with some centralized service, so maybe there's a TLS stack, maybe there's a protocol on it. So now I've got some constraints around this device. Maybe there's a TLS stack, there's definitely a Wi-Fi stack, there's a Bluetooth stack, there's TCP involved. That's a lot of overhead, if you've ever looked at those protocols for this little cheap device to handle. What are the concessions that are gonna happen? Okay, so I've got my MQT broker, MQT broker in this example, I've got my analysis engine, I've got a database, I've got, well, I'm probably gonna microservice containerize the thing as best I can. I wanna have a web UI, I've got some HTML5, probably gonna do this all up in Node or React, and obviously I've got a mobile interface down. And all of these things that are up in the cloud, well, I can update them in a DevOps-y way very easily, but my IoT device needs some mechanism, so I'm gonna have an over-the-air. So I've now put all of these constraints on the system that have design implications, security implications, and privacy implications. So from the outset, I need to make certain that I'm setting the platform requirements. And from a design goal, that means selecting a tool chain that is going to be the best for this particular device. I may have different CPUs, different memory configurations, different interface chips that I can use. This is the perfect time to build one of each. You get a reference board, and then fuzz those protocols to see which is the thing that's most stable and is gonna give me the most amount of room for my application. Because that room for my application is gonna allow me to manage the privacy of what's in there better, because that device instability could be an entry point. We've seen examples after examples of devices that were unstable and that instability that becomes the ultimate attack vector into some organization, home or business. Similarly, the development frameworks are going to have a role to play in this because a lot of architectures where I've got very distributed processing, that's going to mean some level of data transfer. So how am I securing the data in flight? How am I securing it at the other end? If I've gone to a cloud-based service, how much of the ownership of the infrastructure that I'm transferring to that cloud-based service is actually going to be something I need to worry about from a regulatory perspective? Like if I'm taking credit card information, is the underlying infrastructure something that can handle credit card information under PCI guidelines? These are the kinds of questions to be asking at the architecture side of the decision-making process. Obviously, as I'm doing my development, I want to make certain that I have continuous assessments that are happening because if my developers don't have the right level of security training, they're probably also not going to have the right level of data management training to ask the right questions. So what can we do to up-level the skill set within the development teams themselves so they can start to ask those hard questions and be part of the solution? Similarly, when we're building all of this, I'm going to assume everybody's using some form of CI at this point in time, what is the test coverage? Can I actually identify what the data flow is within the application to a degree where I can say, this was encrypted form at this point, but everything before here, this is now tainted and unavailable data to me and have some form of centralized process where I can go and say, how good are we getting? We all know that version one is probably not as secure as we would want it to be, but as long as we're capturing what version 1.1, 1.2, and so forth looks like I'm moving towards the right direction, we're baking it in from the outset, not necessarily trying to bolt it on later. We're asking the hard questions when they're inexpensive to be asked at the outset. And then lastly, as we release this, and personally from my perspective, everything before the time that we actually released software is an academic exercise. It really doesn't matter how secure or insecure things are up until the point that we ship it. So we need to make certain that at the point we ship it, we understand what the governance rules are wherever this device is going to be shipped because there might be a period of time where say six months of development needs to happen before that V1 of our nice shiny IoT device can be out in the world and somebody might have come up with a new regulation and now we have to kind of go and ask ourselves some hard questions. And so that pretty much gets me to very key takeaways for this. From an open source perspective, I wanna look at the contributor side of the equation. We want to ensure that everyone who's contributing code that touches data in any project is able to question why that data was collected in the first place, why that data is part of that workflow and be able to communicate that out because that component is going to end up in some place you didn't expect it to be. And I can say with all honesty, and it will live for a lot longer than you expected to live. One part of our business actually goes and looks at commercial software. We found a vulnerability in a version of FreeBSD before FreeBSD was a thing. And it was still in production last year, 18. So we wanna make certain reviewers can identify what sensitive data is involved and where things are going. We wanna make certain that we're disclosing in whatever our readme doc, pubs, wiki, our expectations around data processing and data collection are. And we need to move away from self-documenting code because the people who are going to make those kinds of determinations are not the people who are going to be able to read the code. And if we assume that, it's kind of game over from that point. We need to make certain that governments is a cooperative action between project leadership and the development teams. No more merging looks good to me if there's data involved or plus one. Ask the questions. What tests were performed? Do you know whether or not this is going to be encrypted correctly? Is the default plain text or is the default something that is securely hashed by today's standards? Oh, we need to change our hashing rules or our encryption rules. Why are we making these changes? Are we ensuring that we're actually using a library or not rolling our own? We can't assume that the component is, say, only going to live in the US because in the US is a patchwork of regulations. We can't assume that the component is only going to live in Europe because there's still a patchwork of regulations. One of our customers in financial services described it best. His application has to adhere to 378 separate global regulations. How can we make certain that our open source components that we're contributing to and we love can actually satisfy that kind of environment? Those are the types of hard questions that we should be asking in documenting all of these decisions so that when somebody in, say, a highly regulated financial services world decides to consume something, they can at least know that, well, you thought about these problems. They might not agree with your assessment, but you thought about the problem beforehand and now a conversation can happen. And by definition, we're going to have legacy stuff out there. And so we need to have some process where the project defines a policy around how it's going to manage data governance. So we all have contributing.md. We need to have some variation on that that talks about data collection and data processing. We need to make certain that we're reviewing our default configurations because the version 1.0 default configuration might be insecure by today's standards and it might be a simple case of tweaking those defaults to make it a lot better knowing that somebody is probably going to trip over version 1.0 and say, woohoo, that's awesome. I really want that version, not realizing that there is in fact version two. And fundamentally start thinking about the software life cycle. In commercial software, we'll have, this is an alpha, this is a beta, this is a release candidate, we're released some lifespan, some updates, oh, wait a minute, we're going to end of sale, we're going to end of maintenance, we're going to end of life. And there's a lot of documentation that goes around this that customers know about these things. But if it's an open source project, most of the time it just gets abandoned on GitHub. And you see, oh, wait a minute, there hasn't been any update in three years. Is that because it works and there's not really anything more to do with it? Or is it because the person who was originally maintaining it decided that it was something else cool to do and they went off and did that or life happened and now they're playing with their kids instead or something. And there's no way of knowing this so that when you decide that, you know what, I'm done with this, update the read me if nothing else to say, you know what, I'm done with this. This is the last version. If you fork this, it's kind of on you to go and do stuff, but I am done with this. And at that point, you don't necessarily have a whole bunch of issues where people are saying, why haven't you processed my pull request? I mean, like seriously, I thought you liked me. And you don't have those kinds of scenarios. And so that's it for my talk. I've got a whole bunch of references for all of those lovely regulations up there. I thank everyone. I think we have about two minutes for questions if there's any questions. Yes. So I think it would be unfair to say that we have a solution per se as much as this is a paradigm that we should all be working towards. And so if we're looking at it from we're consuming those types of devices, there's not a whole lot that we can do short of starting to question the providers of those services or the manufacturers of those devices and say, what do you have on me? How are you collecting it? What have you done with it? Give me all of the details. Regulations like GDPR have that obligation associated with them. Not all regulations do. And even under GDPR, there's no standardized process. And there's no standardized return form for the data. So you don't necessarily have 100% confidence that what you get back is gonna be something that is in fact usable to you. Or maybe for us in the room it might be more usable but usable to the lay person. So excellent question though. Any others? Yes. So I have a couple theories and one of the simplest is in a healthcare environment, the longer time is fundamentally a realization that doctors aren't tech people. And as long as it's kinda working, they're good enough because they much rather go and fix your leg or whatever else is ailing you than worry about patching things. I think that that is fundamentally what it is. And now you effectively have an attack scenario where the malicious actors know that there's a lot of value in healthcare information and the ability to secure it is not as strong at the local level. So hopefully we can do better but it's through talks like this that shining a little bit of a light and having people start asking the questions is really good. So I believe I'm done for time. So thank you ever so much for everyone.