 Before we talk about encryption, I was going to think back about 15 years ago Harry Potter and the Goblet of Fire was the big movie that year so it's a while ago, but it's not a crazy amount of time ago Most applications stored their passwords in plain text, crypt or md5 Openwall released PHP pass to combat this and some of the big projects stuck up at the time The biggest answer on Stack Overflow for how to securely store passwords was this answer right here Md5 is still safe according to this, and it's not particularly aged very well, and the idea of today is to try and give advice that age as well, and we don't look back and think that was a terrible thing though. So, people who didn't take that advice, these sites have been breached subsequently. The number in brackets is the amount of passwords successfully recovered using this hashing method. Siarwan, Linespace, LinkedIn, Anyahu, pretty much got all of their passwords recovered even though they were using Siarwan. Md5, again, similar sort of numbers, Purdue, Lastfm, eHarmony, almost all of the passwords got recovered. The ones that didn't get recovered, they were the ones that are actually quite secure, so if people are using like 35 character passwords and their password manager, they're the ones that didn't get recovered. If you were using sort of like 16 characters that you would have been recovered, and probably the most famous is to play in text, Rocky, that's now included in Cali Linux as a default, so that's a word list that you can just use and iterate through. So yeah, these are people who haven't used best practices even though they're available at the time. So, at the time, even back then when the old list breached, Bcrypt was the thing that came out in 1999. These were sites that used Bcrypt and how much of their database when they got leaked, how much of them got reversed. So Dropbox got leaked and then almost 22% of theirs got their plain text passwords exposed. Shard256, which came out in 2001. Again, it was available at the time, all of the other sites got released, but no one used it, and similar numbers here, so in the low percentages and then about 35. Argon2 is probably, if you were storing passwords today and you were looking for a scheme, this was Argon2 is probably the one you should go with, but there's not enough information about leaks for that, yeah, because no one's had their database leaked. The variability, so between the low, like the 3 and 1% to the 21 and 34, it's down to password reuse, so people have used the old breaches where they've already reversed the passwords and then just tried those same passwords with the same credentials and it's just worked on the next one. So even though they're much, much secure, the higher numbers are purely down to the other sites that haven't done so well with securing their passwords. So we think about the value of the data. If your database does leak, you think about the value of it, and the obvious ones are credit cards, so if a credit card, plain text credit card info gets leaked, it's quite obvious how people can use that for financial gain. Merchant services are actually cracking down on credit cards, so it's much harder to use credit cards as people would have used them 10 years ago because you can't just use thousands of credit cards anymore, your merchant service will stop that. Passwords, they're used more for logging into other services and trying to extract money from other services, so if you log into Amazon with someone's password, you used to be able to just change the shipping address, ship it to you, and then using that password, you get money out of it. The same thing happens with email addresses, so people log into your email with the password, they send a spam email to all of your friends because it becomes from you, it's more trustworthy, so they get their product of however it is that they're getting their money out of it. PII, this is quite a new one and how it's being used, people don't often think that personal identifiable information leaking is much of an issue, but I've got some examples which really show the value of PII being leaked. So talk talk, does everyone know the talk talk? I got leaked a few years ago, it's probably one of the most talked about database leaks of the UK history. There was two trust based scams, based on this on the early days. The first one, when it wasn't really publicly known, was people would ring the customers from the database because they'd have the phone number, the account number, all of their details, they'd ring the person, say they haven't been paid their internet this month, they need to pay £30 an hour or risk being cut off. To prove that they were definitely from talk talk, they would give you the talk talk, cut your talk talk customer ID because how else would they know that? So people did believe that their internet was going to be cut off and pay the £30 over the phone. So lots and lots of £30 later people lost a lot of money to it. The second one was when it was known about. They would ring up the same people or the same group of people and say, sorry I'm from talk talk, we've lost all your data, we're going to give you £100 compensation to do that, we just need your bank details. That ends as predictably as it seems. So people would give over their bank details trying to get this £100 and they don't get the £100 in the bank's short story. The IRS is the American tax system. The tax system in America works slightly differently to the UK. So in America you get like one tax bill every year. So at the end of the year people pay monthly amounts because you only get one bill so people pay it monthly just to try and estimate where they are. A lot of people aim to overpay so they get a refund when the tax year ends rather than having to a tax bill at the end of the year. And then people have to log into the IRS system to do this. And to log into the IRS system they've got a system where you tell them where you worked in a certain year, what your address was in a certain year. They ask you questions that only you should know about yourself. That clearly isn't true when lots of companies are leaking your data. So people, attackers would log in the day the tax deadline opened to submit your tax return for you, get the check back to say that here's all the money that they'd say you've underpaid, way overpaid the tax, get the big check. Then they'd cash that early days. You've got another month or so before you have to submit your tax debt. So you log on legitimately to do that. And by that time you've realised that someone's already submitted your tax return for you. And I think there was half a billion dollars lost by this just by people sending checks to people that they thought were other people. And again that wouldn't have happened if everyone's personal data wasn't already out there. Emails with passwords and phone numbers. So this is quite new really. This started last year. You'd get an email. I imagine everyone would receive these emails. You'd get the email saying that here's your password. I've hacked all of your accounts. I've seen you've been on some interesting websites and I've got videos of that. Unless you pay this bitcoin and we're going to release it to all your friends. They didn't really have anything but because they had your password and sometimes your phone number on there, it made people believe that they genuinely were in their system. And again that's just another use of personal data. And competition wins. This is similar to the talk talk one. So I've received one of these in December last year. I got a text message. My data had obviously leaked. It said where I was. The supermarket closest to me, the ASDA closest to me said that I'd won their raffle this year just by buying something from ASDA. And to claim it I need to click on this link, fill in my details and they'll send me the money over. And it's the same thing as before. They just want your bank details. But it makes it look real by giving the real branch that you were nearby. Most people would shop in their closest supermarket. The supermarket is all public information that you can log on and find out where all the supermarkets are. You know where the address is for the person because you've got that postcode and you can just find out which one's nearest. So again it's another use of personal data where people don't realise that I've given more information away to lose the money. And breaches really are getting bigger. Marriots lost 500 million records, which is huge. If Marriots was a continent, this is where it would sit in the continent's population, it really, really is a huge leak, that one. And it had lots and lots of info in it. Kim Zattas is a journalist. The suite is saying it should not encrypt in your passport to qualify as criminal negligence. It probably should really. You should be encrypting all of your data. If it's personal to a user and can be used in other ways, you should be trying to encrypt it. This one actually happened yesterday. Someone found an unsecured database in China that was doing real-time tracking of people. So it uses facial recognition and you can pretty much log into this database because there's no authentication on it. Just from anywhere, I find out exactly where people were, along with loads of their information. And it really is quite scary how much data gets leaked. Because shipping now and secure it later, right? And it's not just hackers that do bad stuff with information. Morrison's, all of their staff leaked, one of the senior auditors leaked all of their data. I think they're now in prison for it, but there's still a huge piece of data that's now out there because someone leaked it. Heathrow Airport, someone lost a memory stick, which included all travel details of people, including the Queen's travel itinerary. Again, it's crazy how much people will just lose data with no encryption on it. A third of councils, so imagine all of the data that your counsellor has on you. It's huge. They've all reported lost or stolen data in the past five years. A UK head teacher, I'm not sure how many people heard about this. There was a head teacher who moved schools, took all of their old schools data with them to the new school, extracted it all on a network, and then started using that data badly, really. So he then got sacked for it, but a head teacher is sort of the equivalent of a CEO in most companies. So it's even like senior people in companies where you can't really trust with data. So does anyone know who this is? Troy Hornmatt, right? He runs Have I Been Pwned. He tends to send two types of emails. The first one is your data has been leaked by some other company. And the second one is slightly more scary of he's trying to verify a data breach. So if you receive one of these, this is going to be a really bad day at work. He's received a data dump of your data and he's just trying to verify that it actually is your data. So security is hard. Schneider's law is any person can invent a security system so clever that she or he cannot think how to break it, but that also means that lots of people can't think how to break it. And with GDPR, it's a lot more focused on customer data. They suggest encryption or that they suggest that all of your data is encrypted and they even say that a loss of an encrypted storage medium which holds personal data is not necessarily considered a data breach. So if all these people before had this data encrypted, it's not necessarily considered a data breach if they lose it. And if you do use encryption, then the authorities must positively consider the use of encryption in whether a fine is imposed or what amount of fine is imposed. So if you're looking for ways to sort of tell people that you need to have encryption, this is a great one because it reduces the liability of your company if your data is encrypted properly. Oh, and you shouldn't underestimate the importance of good key management, so we'll get onto a bit more of this later, but if you lose one of your... So if one of your servers gets a breach, you either lose your key or your data and then you should really rotate that, but we'll show you how that happens later on. So what can we do to fix it? So Dropbox, they had their password leaked and now they have one of the best password storage methods, possibly really. So the passwords in the centre of it, they then char512 it, then bcrypt it, then encrypt the whole thing. So going through layer by layer, the char512 is to make the string a consistent length and use every bit of the string. Bcrypt only uses the first 72 characters, so the char512 uses the whole string so it can be longer than the 72 characters and it's consistent. So bcrypt longer passwords take longer to hash, shorter ones take shorter on time. By doing this, you kind of get around all that and it's much more consistent. And then on top of all of that is the AS256 encryption and the point of that is if your database leaks, then people haven't got the hashes anymore, they've got data that they can't use. So it's just adding layers to security and all good security is like an ogre, it has layers. So if ogres have layers, security has layers. Ogres have layers, security has layers. So what can we do to add layers to our database for securing it? So firewall the database off. Surprisingly, people don't do this as often as they should. It should be the first thing you do, your database shouldn't be accessible by anyone but your applications. Clearly the Chinese database 4 that was all leaked because it was accessible from the outside. Have only the trusted clients on your network connect to it. So not even your entire network should be able to connect to it, just the ones that you trust to connect to it. And then with those applications, minimum access. So if you only need to select from a few tables, only give the permissions for those few tables. Full disk encryption, it's okay. I don't really recommend doing it for servers. There's a lot of sliding that later on. But what more can you do? So looking at this code, this is used quite well for everywhere really. You set up connections to the database, you query it, you get all your data from the users, and then you iterate over it. If this was used as an API, everyone would be saying that there's no encryption there, HCPS isn't used, but that's never really the case for databases and nobody considers it. It's actually quite easy to use encryption with MySQL. You just tell MySQL to verify the certificate, give it the path to the certificate. This exists for PDO as well, but it should be the first part that you do because if someone does breach a part of your network, then they can sniff the network traffic to the database, which is now all plain and all of your customer data is going all over this plain traffic. If it's encrypted, you can't sniff it anymore. So encryption at rest, this was mentioned before, disk level encryption. You just can't start without a password. If it's clunky for starting servers, you need to be there to type your password in. It's much better for devices which haven't got five lines of availability, so my laptop, that is full disk encryption because if I take it somewhere and it's out of battery, then people can't use it, but my servers, they're on all of the time, so it's just a slight slowdown really. I wouldn't really recommend using the full disk encryption unless you're required to for audit reasons. Your backups, so when you've done your backups, you keep it on the same server, you send that somewhere else. That's now, your backup's now plain text somewhere else. All of your binlogs there until the MySQL 8.14, I think, 8.0.14, all of your binlogs weren't able to be encrypted either, so any other MySQL servers that you have, when they were trying to keep up to date with your application, they weren't encrypted. So again, that traffic can be sniffed. So what can we do about it? The answer really, I think, is application level encryption. So now your application knows the keys. Your database is just storing data. Your database has got no idea what any of this data is. It's just storing chunks of data. It encourages the one application, one key approach, so only one application can talk to that data. It's still outscaling, so you can still auto-scale, you can still do all of that. It's a little bit more difficult to implement because you've now got to use your application a lot more for stuff that it wouldn't have been used for previously, and we've got some examples of that in a minute. So any security system has the CIA triad, and that means confidentiality. So is the data you hold secret? Can everyone see it? Integrity, can you trust the data? How do you know if it's not been tampered with? And availability, is the data available? So confidentiality. What access controls are there? Who can see it? Does that data have any protection? Integrity, if someone changed it, how would you know that? I find integrity quite scary. If someone's inserting records or changing your records, how would you ever know that? If you're a shoulder backup, are you certain it's a match? Are you sure your replicas are a complete match? Could someone have changed one of them and introduced something odd in your application that gives them a benefit somehow? How would you ever know that? That's the scary question with integrity. Availability, this is probably the most talked about one, is the data available to those who need it? So DDoS attacks, they're common to attack availability, they're trying to remove the availability of the application. Does whatever it is come back in a timely fashion? So if you're using a database, can you get your answers back in a timely fashion? And is it up all of the time? So looking at how we do the encryption, the easy way, and I've done this a few times, MySQL is quite useful. It comes with this AES encrypt function where you pass the texture encrypt in with the key and it will just encrypt it and you're away. MySQL defaults to AES 128, which NIST says is secure until 2030. It's still indexable because whatever goes in is exactly the same as what comes out, so you just index the results of this. So it's still nice and fast to search, but it's not particularly secure. So here's some data that was encrypted with it. You can see it's a surname based on the column and back down here is a list of data that got encrypted with it, which looking at that on its own, you would never be able to get back when you start adding dimensions to the data. So first name, you can see that the first name here matches the second line here and it happens again there and there. These are things where... My first name is also a common surname, so that could be my first name, that could be my first name as well. That's a different name that has that characteristic where it can be a first name, it can be a surname. Again, if you add another dimension to it, like department, you can start using public information like LinkedIn to try and work out who these people are and it actually just becomes a big puzzle. People who like programming get the feeling of the buzz when they get something right. You get a huge feeling of buzz when you start connecting these things together. As always, there's an XKCD for everything. When the Adobe passwords leak, they left the hints as plain text and it's basically a crossword puzzle and you're trying to avoid that with your data. People shouldn't be able to piece back bits of data to try to work out what the encrypted part of it is. This is also a famous example, so un-encrypted. There's the tux penguin. Encrypted with ECB. There's also the tux penguin. People, when they've fed the data to it, would think it's all different now. It is definitely encrypted, but true encryption just looks like noise. It's hard to tell the difference between those two. I'm not knowing that you're using ECB. ECB is an electronic cookbook and it's quite an old way of doing things. It works on blocks, and that's why the blocks end up like this. Other issues are your network traffic and your binary logs and your backups. If you're using the AS encrypt, all of your network traffic, you're still sending all that sort of MySQL with the plain text data. Sorry, with the encryption key both servers need to know it now, the database server and the application server. Your bin logs have all those updates as well and your backups will have that same sort of information there. You've not really solved a lot of problems. You can salt it to make it a bit better, but it's not a proper solution really. You need to still the salt on each row. It doesn't change over time, so every time you do something, it'll go back to the same method. There's no sort of random about it, and the weak salts just slow down the guessing attempts, so it makes the crossword puzzle slightly harder for somebody. It doesn't actually stop it. In terms of CIA, we've got the availability, really. The CNEI don't really exist, so it's a no. What else can we do? If you search for PHP encryption, you get OpenAS Selling Crypt. OpenAS Selling Crypt, you get the signature for it. It really does need somebody who's quite interested in cryptography to understand what all of this is. IVs initialisation vector, who knows what that does. AAD, I have no idea, tag length. It's all random stuff, and you shouldn't need to know all of this. It should be much easier. The other one that comes up quite a lot is Mcrypt, but it's deprecated. It's not being maintained forever. So even when you search for a lot of these things, OpenAS Selling Crypt and Mcrypt are top results. Please don't use them. As of PHP 7.2, where the first language has modern encryption built in, so LibSodium is now baked into the core of it. With most things, simplicity is the best, but not with encryption, but the simplicity is hidden by LibSodium. LibSodium does a lot behind the scenes. It does a lot of stuff that you won't know it's doing. So how can we use it? Unfortunately, because it's still quite new, this is the PHP documentation. As of now. It doesn't really get you anywhere, and when you look at one of the functions, it just says it's not currently documented. Fortunately, Paragon Initiatives, they've got a book on it, and it is very, very good. I'm not sure why these things aren't in the official documentation, but these are the people that really pushed LibSodium into PHP. So definitely recommend looking there for the user guides. And this is how you would use it. So you'd first create your nonce. A nonce is a number used only once, and it was clear that nobody in the UK was ever consulted about the name. Then working from the inside in, so you set up your secret box for the message you're encrypting, you pass in the nonce that you've just generated up there, and the key, so the secret key for your application. You then append your nonce to it and base 64 the entire thing. And then for security, you use Sodium Mem Zero, which just zeros out the memory, so people can't even look at the memory of your server and work out what's there, and then you return the ciphertext. So it really is quite easy to use. Crypto secret box. It does authentication, and I think it's on the next slide. So it uses Char Char 20 and Poly 1305. So Char Char 20 provides your encryption, and Poly 1305 provides your authentication, so that's like a Mac around the whole thing. So if somebody changes one part of it, then the Mac doesn't authenticate, and the Char Char 20 then doesn't even get used. So it really is encryption with integrity, which is, you can do it with open-ass cell, but it's clunky. This is baked in, it's dead easy to use. All handled internally, and you don't have to worry about any of it. So now your salt's been replaced with a nonce, so nonce is used at one time, so if I encrypt the same word twice, it's different each time. It's verified by a Mac. It's not quite CIA still. It's closer, but you haven't got your availability, because everything's now different, so you can't index it. So all of your data is now 100% secure, but you've got to do a full table scan and a full decrypt every time you want to use it, which is less than ideal. If you've got very large databases, it takes quite a while, so the size of databases we have at Cyc took about six hours to encrypt, and you don't want every query you do that will upset a lot of people. So thinking about how to make the indexing faster, you have to think about what computers are good at, and they're good at comparing numbers and basic operations with numbers. And that leads to a bloom filter. A bloom filter is, it has two outcomes. Outcome one is it's definitely not the data you're looking for, and outcome two is it could be the data you're looking for. So it's exactly like an old style phone book. So if you look for S, then Scott could be on the page, or it could not be on the page, but if you look in A, Scott is definitely not on the page. It's as simple as that. So plain text example, your bloom filter would just be BS, it would bring back a list of everyone, so you'd loop through that result set, the first one you're looking for is Scott, so you'd say, yeah, that's it, you'd carry on looping through it. Sky know, Sammy will know, so you've got the end result set and you've had to filter that in PHP, but it's a much smaller result set and you've got the data you're looking for. But to scale that, 26 characters, you still need to read one 26th of your database for 100,000 records, that's nearly 4,000 records, and you do have an information leakage. So the start letter is an S and that really does help with trying to crack these things because you've got information that you already know about it. So what could we do to change that? We can use different functions for it. So Sodium has something called CryptoShortHash, the CRC32 which is used for every TCP connection that verifies that everything is correct and XXHash, which is lightning fast. It's used for a lot of video streaming. Truncate these if you need them. Ideally you want collisions in your bloom filter but you don't want lots of collisions in it. This here returns a 64-bit number which is absolutely massive. Most people don't need that. You probably need about 16 to 20 bits of it depending on your data size. But you'll have to basically aim at a number that's about half of your data size for the size of your bloom filter. So the encrypted example. So these are now values that have been through these functions and it's exactly the same thing as the bloom filter before. So this value here. There's nine of them. These could be the same. They could be different. Who knows. You've got to loop throughout every once try it. So you'd look up your key. So it's deterministic each time. So if you do like CRC32 SCOT that could be that number. And then you've got to loop through all of those nine and decrypt it to verify it's the right value. So here is the total nine of these rows. You'd have to loop through each one. Decrypt that. Make sure that that email is what you're looking for. And then return the correct value. If people have used generators in PHP before this is a fantastic use case for generators. If you haven't used generators I would definitely suggest looking them up. You basically use the yield keyword and then your function carry on and then when it needs more data it would go back and then get more data out of the function. So it's a slightly different concept but they're very, very good in this instance. So the final result is you can finally hit the index. Your bloom filters allow you to hit the index because that is a much smaller result. You've got collisions and it's much safer. Your data is still secure and it can be taken further if you need to. So you've finally got your CIA all three are fine and your index is now already fast again and all your data is secure. So how do you move to encryption? The easiest solution is get the change ready in your application. Put your application in maintenance mode run a job to loop over your database and encrypt all this data and make your bloom filters in your indexes. Change the structure and encrypt it all at the indexes. Release that change and then end your maintenance mode. Clearly there's downtime there so although it's easy it's not really feasible. People don't really want to have downtime to do these sort of things. So how do you deal with that downtime? So first step you would add your new columns or however you want to do it you can add one single column for all of your encrypted data or multiple columns. You change your application to read both the encrypted and the plain text versions. So at this point in time your plain text and your encrypted ones are being written but only the plain text ones are being read. Run a backpopulation script to move over all of your old data from your plain text version to your encrypted version. So however long that takes and then you run a second job to verify that just to make sure that everything has actually moved as you expected it to. Remember that at this point you're still reading from the plain text so all of this new stuff isn't being used. When you're happy that everything is there and it's all fine you change your applications to read from the encrypted columns so now your plain text ones are still there but they're not being read from and you can monitor your performance you can make sure that everything is now moving over properly. The next bit is to remove the permissions from the old columns and the reason you remove the read permissions is because if some part of your application that you've missed suddenly stops working it's very easy to add the read permissions back in and if you've dropped the column it's not as easy to add all your data back in. So remove the read permissions from that one column and then change your applications to write only the encrypted and then once you're happy that everything is there you can remove the plain text columns just drop them and all of your data now should be fully encrypted and after all that you can sleep better knowing that your data is fully encrypted and even if your database does leak nobody can use all of the data from there so all of your customers' data is safe. I remember using interface in all of this I see quite a lot of people when they implement these sort of things to do a VFL statements and it gets really really messy just use your interface put an interface in there that does the writing put an interface in there that does the reading and use your dependency injection framework between which one gets used it makes it really nice Danachroyd's got a fantastic presentation on interface separation if you've got time give that a watch it's really really useful So something pre-made the encryption we've just talked about Paragon initiative we've got something called Cypher Suite if you're just using base encryption like we just talked about then definitely just use that that'll solve most people's use cases it's all in LibSodium it's all really well written it's got hooks for like doctrine stuff like that so if you're using doctrine it all just links into it it's really really good it makes it very easy to move to it but if you've got something that's slightly more complex it's a bit harder to move to things like that so if you've got for example application A, application B write reading read and write to the database just write to the database or if you've got like a reporting server if you've got a database and your application reads and writes to it it's a bit more difficult to sell because you can't just have that one key system so ideally a chance for a refactor if you can make the application B depend on application A which then reads and writes to the database that's sometimes not feasible but if it is that's definitely the way forward or if you're a reporting server it probably really shouldn't be reading any of this personal data you need to maybe refactor it to read from the application instead but if you can't then you can use asymmetric encryption so this is what's known as a hybrid crypt system and it's used by things like GPG and WhatsApp so this is when you have like a group message in WhatsApp everyone's message gets individually encrypted but everyone can decrypt it and this is how you do it in LibSodium so again you'd get your one-time key so your nuance you'd encrypt your data with that key it was not a nuance so you'd get your key you'd get your password but randomly generated you'd encrypt your data with that randomly generated key and then you'd encrypt that key with the public key of each application so each application needs to have its own public key encrypted with that key encrypted with this public key so it'll be different then you would save all of the data something like this so your applications application one's got it's secret encrypted with and application two's got it's secret encrypted with the other one and the payload is the shared payload so to read it you would look up the bloom filter value as before you'd find the data row and then look up the application specific key, decrypt the secret and then you've got everything you need to do it even needs to be filtered and then maybe move on so as an example here is your JSON array containing all your encrypted data you'd find that your application two so you'd look up application two you'd then decrypt the secret so that it's now now known to application two you'd then use that secret to decrypt the payload and then get just the payload out of it and this is how it was looking code so when you're decrypting it you decode that array that we had there and make sure you throw errors on exceptions er sorry throw exceptions if there's any error at all you'd look up your application in the for loop and if you've got it then you get the key, if there's no found key then you say you can't decrypt it and then you use the same decrypt version that we saw earlier just with a single key this time passing in the application key's private key to get it to decrypt the whole thing so it's quite easy to do it's not many lines of code your bloom filter's become a little bit more difficult because you need to have the secret key so before we had the secret key for the application now you can either use no secret key so just literally passing the data to CRC32 for example or you can use a global one so every application uses the same key or you can just use per application so each application's got its own key and each application's got its own index that's a lot harder to get to but if you're using things like event sourcing it's a bit easier because you say that this data's changed and then each application can then generate their own bloom filters so if you have got an event driven architecture then per application's quite easy if you haven't global it's much easier and some things to be aware of when you do these sort of things truncated data so mySQL is quite helpful in the fact that if you just insert stuff into it it'll just say yeah fine I'll try and do the best I can so before 5.7 the default was to silently truncate a string so if you've got a column that's defined as a char one you insert two characters into it it just stores A and gives you a warning it's quite useful in some cases because you still get a lot of data but now you've got that Mac on all the encryption if you've lost a single value of it your Mac doesn't pass any more which means you've lost all of the data so you should really try to enable your application to throw errors when this sort of things happen because if it has happened you've lost that data then so this is one of the things you should be checking for when you're verifying that your application is doing things correctly case sensitivity again mySQL is helpful that you probably don't realise it does this a lot of the time it defaults to Latin 1 Swedish case insensitive for less than 8 and greater than 8 it defaults to UGFA multi byte accent insensitive and case insensitive your application probably depends on it you probably don't know your application depends on it but if you search for like an uppercase version of an email address or a lowercase version of an email address it doesn't matter when you encrypt it and your bloom filters then do matter so you need to make your bloom filters so the easy way to do that is just to strings are lower or anything that goes in your bloom filters and store the lowercase version of it you probably just need to make sure that you can get your data back out when you do these things white space as well the SQL standard SQL92 it is it requires you pad the strings so that the strings compared are exactly the same length so if you compare like one word with a space at the end of it and one word without a space at the end of it it adds the space to the end of the first word and then compares those two strings they're now identical so it says that they match and when you do things like that it's basically trimming it for you but your application again probably might depend on it so you need to check that there's a DB fiddle if you want to play with that but it happens for most major database engines so it's one that follows SQL92 which is a good chunk of them it's just not an obvious thing that it does performance weeks so you can change your column collision so Latin 1 is fine for it because it's now just base 64 encoded data it's probably the only time I'd ever suggest to use Latin 1 over UTF 8 but you can save a lot on storage if you do need to save on storage your row format should be dynamic if you want to have multiple multiple big strings in your tables if it isn't dynamic which is when you've got all the tables then you have issues with it only it stores all the data off somewhere else so you have issues like performance issues dynamic there's no performance issues so you can store everything individually if you haven't got dynamic just store it with one big JSON blob after 5.7 it's the default so if you're using that then you definitely are using dynamic but always check with your DBA if you aren't sure about any of the performance on that compression something they get to talk about quite a lot encrypted data is slightly bigger so people say come in to compress it but hate CPS breach and crime they both attack the compression layer before they got encrypted so don't compress it unless you know the apps and certain it's safe against the types of attack that happen it's not that much bigger just pay for the storage Cypher text isn't compressible so don't even try to compress that just store it as is unique keys so sometimes personal data is used with the unique keys you need to move that to the application now it's quite commonly used with the email address so you put a unique constraint on the email address and now your application has to do that logic so the application has to look up to see if that email address has been used before so anywhere where you have got a unique key in your database just make sure that isn't used with personal data otherwise you'll have to change your application to also do that after you've started your encryption fuzzy matches so something else people do is they want to know where email addresses are like Hotmail.com so anyone is using Hotmail or like a salary between X and Y so they'll get a list of people who earn a certain amount it's possible but these things need to be known in advance so you'd build your bloom filter on the domain part so you'd split out the domain part build your bloom filter on just the Hotmail.com part or just the Gmail.com part and that would be a separate bloom filter that you can then look up each time but that has to be known in advance someone can't just come along and say I need to know this now because it'd involve decrypting all of your rows to work it out the salary between X and Y is harder because it's a boolean value it's harder but it's still possible it's difficult key management so if you lose a key now you lose all of your data which is quite important because that key is only like 16 bytes or something crazy but that now represents your entire data so for whatever reason you lose that you've lost everything keeping it secret can be hard but if you have a lot of implications to scale do need to know that certain people will need to know that in the business if they need to be able to restore a backup for example they'll need to be able to get the key auto scaling complicates it a little bit because whatever's spinning up those servers now needs to be able to put that key on the box but there are solutions for it HSM is probably the best solution but they're very expensive AWS offer KMS as well which they have the HSM for you and they just store a portion of it on their HSM so if you are using AWS KMS is fine if you're not using AWS vault is very very good I'd definitely recommend using vault or if there's other secret key management systems if you can though probably stick to those to that they are really good your key rotation so you need to think about how you're going to rotate your keys so when you rotate it it's going to be basically the same process as moving to encryption because you need to put the new encrypted rows in there and basically go along that same process again if your key is compromised you need to think about getting rid of all your data and rotating it all because if someone ever does then breach your data then they've got the key for it as well so you've basically lost your data if someone breaches your data first then your key is saved so rotate that key again if either one gets breached make sure you can rotate them and it's possible to you can even think about rotating on a certain time period so every six months for example if you then ever get breached you can then date when some attacker's got your data you'll be able to know exactly when they were in your network and hopefully that should be able to help flush in the mount moving data to cache this one's actually quite common as well so using red or some mcache you pull out all this data from the database you decrypt it all and you think I'll just throw it all in reddys because now it's likely faster to look up again but you've basically just moved your your unencrypted data to reddys and reddys has got a lot less like security around it than sort of mySQL but you've basically just moved it from one data store to another so if you are using reddys and memcache make sure you're storing the encrypted versions and just decrypt it every time your application needs it low entry data so this is the boolean value you've talked about before it's hard you need to avoid your index is probably didn't work as you thought they did before mySQL doesn't particularly work but well any database doesn't work particularly well with boolean data I indexed because unless they're roughly equal one of them is always one's going to be higher than the other so like there's going to be more falsies than trues or more trues than falsies at which point most database engines will be like I may as well just do a full table scan because it's going to be quicker to do the full table scan than it is to look up each row individually from the index you can do booleans you can just you can store it there's two ways that you can do it really you can just store it with another bit of the data so that it gets a bit more entropy or you can put some fake values into like seed the bloom filter a bit more so there's more values that don't match in the bloom filter your bloom filter is really just as long as not everything matches that's what you're trying to get so you don't want to be able to look up just the bloom filter and get your data back you need some ones that don't match but only your applications know that so you just need to put some more data in and timing attacks so now because you're with so like my request for example will have other customers data in there while it's trying to filter it out this is the string equals so if you do like one string equals equals equals another string this is the the process goes through so it checks the string length equals the string length of the second string if that's true then it carries on if it's false it exits very early so it's quite surprising how much data you can get from this so that if you can try this at home writing a simple script just to compare two strings when their strings match it's actually quite a bit slower so you can really tell if you iterate over it like 15 or 20 times you'd be able to tell how big that data is not knowing what that data is and then there's end string equals val that depends on your architecture it's all been rewritten now but basically just loops through character by character so the more of your string matches the slower it gets and you can then work out slowly again over time what that data is being compared against triple equals use hash equals which is time in attack safe anytime when there's other customer data included in your request make sure you use hash equals and thank you for listening is there any questions there's the time for questions yeah part of CIA integrity and a lot of your talk was about encryption is there a particular you mentioned the example yep so libsodium does all of that the entire row you mean or just the data you're looking at like there so libsodium does all of it built in so go back to where ever it was so it's not the other libsodium one sorry so libsodium when you do the when you use the crypto secret box that does all of the integrity for you if it's the entire row that you're looking at then you would just you can just get the integrity out of this so you just use the poly 1305 again for the whole row all you're really defending against that though is somebody having access to your database and moving one slight bit of data to another bit of data so like one one portion of the data to another row so it's kind of a known attack there for most cases using libsodium crypto secret box will do all that for you so you don't have to worry about any of it but if you do need the whole row then just use the poly 1305 methods of it if you look at the libsodium guide there's ways in how to just use the poly 1305 portion thank you hello thanks for the talk I wanted to ask you about how do you maintain your development environment in this respect do you also on your local machine in development cluster we also maintain all this cyfer, security, key rotation yeah you should be trying to keep it in there but don't use the same key as live so each environment should have its own secret key your dev environment your stage environment they should be using things like faker to generate the data so there's no actual customer data in there and each one of them should represent live with the same way it's still encrypted but everyone knows the key is the difference and live nobody knows the key okay so you're for example maintaining a world solution in your development cluster just the security rules the key rotation all this is relaxed we haven't actually rotated any keys or anything yet but I would try make your dev environment and your stage environment match as closely to your live ones as possible when you start introducing differences that's when you start getting weird errors that you can't easily then reproduce so you should always try a dev stage and then put it on live when you're happy everything's working but everything in live should be replicated in dev and staging okay thank you very much when I've done a bit of encryption our stuff's come out as binary what would you recommend storing it as hex or binary in the database I've stored it as hex just so it's a bit more readable but binaries work just as well yeah it's just a bit easier for people to work with it doesn't really make a difference either way if you're comfortable looking at it in binary then go for it people kind of like to see it in their database engines that they can still read it but you can't really do it at a binary date you end up with weird characters so hex is fine something you said during the migration bit made me think so would you recommend obviously in a classic database you have columns of data stored once you go to encryption does it make more sense to just have a blob with unstructured data and then one being encrypted in the whole lot it really is up to you so the way we've done it is to have an encrypted version of each one so there's like a surname encrypted and all of that there's loads of columns that you can then use the benefit of that is because you're getting more data back with the requests anyway if you had everything in a big blob you'd be getting loads and loads of data back from your bloom filters if you put it in just the date you want like just the surname or just the email address it really does reduce the network time down so it depends on your application it's easier to use it as one big blob but I think it's better to use it as multiple columns check with your specific application then so you'd still have individual bloom filters for each column that you'd be indexing so no so the bloom filters I store in a separate table usually with a the first there's like two columns in there the first column is what type of data it is so it's a bloom filter value so it's a surname bloom filter that's how I store my bloom filters but you can store a bloom filter in line the difficulty with storing it in line is if you need to start doing things like your likes and your fuzzy matches then you need another column for each one of them and it quickly gets out of hand because you've got a lot of columns to maintain so the data itself I store in different columns but the bloom filters I store in a table in each row so if you have the data so different databases like DynamoDB or MySQL they offer an encryption at rest so the data is all encrypted at rest why should we go for the bloom filter so the encryption at rest it really only depends against when your application is offline when your application is working it may as well not be there because you're still reading the plain text values out so for example if you get an SQL injection attack on your website if you've got full disk encryption people will still get your inside database out it won't make a difference if you've got the application level encryption they won't be able to get the whole database out and even if you do get the whole database out it is fully encrypted so your application level that your full disk encryption it defends against some things but I don't think it defends against servers very well there's just one at the front as well just from a project and team perspective how long did it take you to go from all plain text in your database to fully encrypted and how many of your team are fully understanding the way this works we're still trying to get it in so we I've used it on other projects at Sykes we're trying to get it in the peak periods of travel is in January so it was a bit risky to go for this January but as soon as the peak periods over we're going to look at getting it properly and then we've got proof concepts working and things like that but in terms of lots of developers using it we've not kind of got there yet I've used it on smaller projects of my own but we're still moving it into Sykes at the minute there's a question at the front just at the front sorry yeah thank you for your talk it was a lot of useful information but I was wondering will you make your slides available afterwards so we'll go in straight after so yeah have a look on joined in rate the soil quality there so we can download your slides and look up the information yeah thank you any other questions this is about caching so you mentioned that we need to store the encrypted data and then we need to decrypt it every time so is it going to affect the performance at all very very slightly so the performance you're using the bloom filters now and they're based on numbers so based on integer columns so for the database look up I know it's not specifically cached but for the database look up you get a slight speed increase by using bloom filters rather than text searching and then you get a slight decrease by having the encrypted value then decrypted so for the caches you'll just get the slight decrease but for the applications I was writing it was about 3% decrease of speed but one of the things as you're moving into it you need to make sure that your performance is there if it isn't there there's probably some reason why it isn't there the char char 20 is designed for most dark sections to be really fast so it's probably not going to be the bottleneck in your application but have your monitoring turned on there's loads of things like data dark new relic that they monitor your application's performance when you do start doing these things make sure that you definitely do monitor it though they shouldn't make much of a difference to your performance okay thank you any other questions right at the back right at the back row I'm curious about the key management you mentioned the fact that if you're not careful with the keys losing 16 bytes loses you all of your data that wasn't really addressed in the talk is there any advice you can give for dealing with that so vault is very good I would try and use vault or the HSM HSM you get the guaranteed safety with it vault is trying to be the HSM really so it's managing your keys for you the difficulty is that a lot of people might have access to vault so it's trying to reduce the amount of people with it but making sure it is still secure in that environment so only the applications that need to know it are in there and make sure you back up your vault obviously can you forgive my ignorance I don't know what the HSM is it's a hardware security module thank you so it's basically a hardware key that stores all of your your stuff for you they're expensive but they're very good I think Amazons is about 16 grand a year so if you don't currently have a HSM I wouldn't recommend just using it for this use something like KMS or Vault but if you do have a HSM throw it in the HSM