 Hi again everyone, now we have Karni Krishnan Musami from Lazada. Thank you. Hi, good morning everyone. Today my talk is about data masking on MySQL. But we basically do on MariaDB. So I am the one who is the one-way one talking other than MySQL today. In this room. Yeah, in this room. So I am also a big fan of MySQL, so don't take that person. He won't be saying. So I work for Lazada. We are one of the e-commerce group in Southeast Asia. And we are a part of Alibaba also. And my talk is about data masking. So what is data masking? So it's basically in some cases we need to change the original data with the random characters or with the random values to share the data outside the companies for third party vendors or something. So for that reason we are using it. Data masking is good then because before masking it's five-digit salary after masking it's six-digit salary. Yeah. So to protect the confidential data in the test environment. So the reason for the data masking is to do the analytics, market research, development and then for the PCI compliance and it varies according to their company needs. So what is sensitive data? So the sensitive data is basically like credit card information and bank details and customer's personal information mostly. And then it also depends on the business needs. So these are the data masking methods which were available for MySQL and which we tried. So first one is Pyruple. So Pyruple is basically it will capable of doing the data masking but it is not a very stable one. It is very hard to maintain a Pyruple replication thing. So next is the column level privilege. Column level privilege is it's very simple to use but the thing is when you have too many users it is very critical to manage and then in the test databases if your application is pointing to the test DBs then column level privilege won't help in that case. So next one is the proxies. So MariaDB has its own proxy like max scale and then next is proxy SQL. Using the proxy you can do still able to do data masking but still it won't descent one days which I am going to discuss in the next slides. So there are other commercially third party tools were available for doing data masking on MySQL and then the final method not the final method which we figured out and which helped us is triggers. So reason for not using proxy SQL is in proxy SQL you cannot able to do a select star on a table when you do a masking on the table and you need to write a really tough regular expressions things for doing a data masking. So we skipped this proxy SQL and then the reason for not using max scale is, max scale has its own descent one days like this kind of query still reveal the personal information and this kind of queries you eventually if you try the random numbers in the where class still you can able to get it done because the data inside the database is not fully encrypted or masked it is only masked in the client side only. So after that we tried to do after doing lots of spending lots of time on trying all other tools we tried to use a simple method it is triggers so everybody knows how the trigger will work in statement based application. So in our case we just need to do masking only on the slaves we don't need to mask or we don't need to encrypt anything on the masters. So if you are using statement based replication the statement the binary log contains the SQL statements and then that SQL statements will get fired in the slaves and then that will invoke the triggers. Master it's a production. No, read queries will go to the master but the environment which is in the master it's fully a production environment there we cannot mask anything internally in the database so that we I will show you the setup of ours in the next few slides. So if you are using statement based replication this will work very easily you will just need to create triggers only on the slaves and then when you insert or when you update something on the masters the trigger get invoked and then in the slaves the data will be masked but if you are using row based replication that thing won't happen because in the row based replication the events in the binary logs contains only the row changes so it won't have the statements which were executed on the master. So for example if you are using PT online schema change you will get to know the difference because during the time of when you run a PT OSC on a master it will create a few three triggers in the master but that three triggers won't have any effect on the slaves but still the online schema change will be successful due to the binary log events. So in this case this is the example for the row based replication triggers if you have triggers in the slave but it won't have any effect on the slaves because of row based replication but in our environment we are totally using row based replication. In this case MySQL Perkina servers does not have this future which we are using so that we used MariaDB so MariaDB has this variable slave run trigger for RBR. So what this will do is in the in this kind of situation when you enable the variable on the slave so this will trigger the this will trigger the triggers fire the triggers and then it will mask the data in the slaves so it looks like pretty simple you just need to have your triggers in the slaves and then get it done. So you just need to create a triggers for this two events in your slaves create update so if you are in this case in the row based replication in the MariaDB this is the run so your data will get masked in the slaves without any issues but it is very simple when you are using just two tables or very few columns you are masking but in our environment we are masking more than 400 columns across 110 tables so it is very hard to maintain. So what we what we tried to do is we just automated all the processes using our we just write our in-home python scripts and then we automated it and then the adding and the removing of the sorry adding of the column for the new data masking is a configuration based on so this is the example for the configuration so we maintain the config like this masking method database name table name column name this is the example hash column is the masking method database name table name and then the column name so we use three kinds of masking methods hash column hash phone hash email so that means hash column means it will do a md5 on the column hash phone means it will create a random number for that column hash email means it will just create some dummy value on the email column so our setup is like this so you ask the question that why we are not running with this because in the production environment we need a real time data we do not want we do not want to take any data in our production environment so this blue one is our production environment this is our testing or data science environment so what we will do is we will and then the blue ones are our Percona servers green ones are the MariaDB so you can see the legends over there. It is a sensitive data that you want to hide it should happen on all the servers because there could be somebody who is directly talking to master and then getting that. No the thing is in the master and in the production environment only the application will have the access this environment will be accessed by the third parties we will give the data to the third parties for a various reasons. So yeah in the middle versions for in the MySQL case you can even use the proxies that we discussed in the previous slides right the proxy SQL and max scale. And now this is a real time environment which this service were used by our development environment also so we do not want to reveal our customer information to the developers so what we will do is we will deploy all the triggers in this server so this is act like a master for the testing servers so from here it will get replicated to the testing environment and the development environment so this green ones are MariaDB instances. So for doing this the challenges are building a new data masking server which means creating this server is a really a time consuming task if you are having your database sizes very huge so in our environment we our database sizes are really huge so building initial setup will take quite a long time and then if you are using PTOAC then this is another pain because whatever you do on the masters the DDL needs to be replicated over here over all this one the PTOAC will have some issues on this one that I am going to cover on the next slide how we handled it. So building a how to build a new data masking server so you need to take a backup from the master and then you need to run a update on all the tables which you know which we intend to mask for example in our case we need to run update update on over 110 tables so and then enable enable the slave trigger parameter and then set up a replication between the masters and slave in this case in our case we use GTIDs in our Percona servers the MySQL GTID and MariaDB GTIDs are totally different but still you can set up a replication using the native method but you cannot do vice versa MariaDB GTID to Percona GTID or MySQL GTID won't work so this things are mostly what we will do is we have automated scripts for it when you run the scripts it will automatically create the update statements and update statements and triggers for insert and update events so we will just try to deploy it on our servers so what are the challenges of PTOAC the PTOAC most of them will know how PTOAC works PTOAC is basically works based on the triggers in this case when you use a PTOAC the replication between the master and slave will get broken whether you are using a whether the table which you are trying to alter is having a masking column or not regardless it will fail so this is how PTOAC work in row based replication so PT in the master it will that there are three triggers will be created on the master the same three triggers will be replicated on the slave but the triggers won't have any effect on the slaves the inserts updates and deletes which are happening in the main table will get will get forwarded using using the triggers to the new the temporary tables so once the once the PTOAC is successful this table get dropped under the stem table get renamed as the old table so as we have the triggers in the slaves in the RBR environment what it will happen is that the triggers which were created by the PTOAC does not have any effect on the slaves but as we enable the slave and trigger RBR on the slaves this triggers get invoked in slaves as well so it will throw this error and it will break the replication between master and slave in this case it is very simple to fix this one you just need to drop the triggers which were created by the PTOAC only on slaves once you drop the triggers which were created by the PTOAC on the slaves and then if you start the replication it will it works fine because the triggers won't have any effect on the slaves for the PTOAC the online schema change will be happened using the binlog events it is always recommended to have master and slave in sync because you never know what is the updates that are happening on the master and at any time on the slave general recommendation is master and slave should be in sync whether it is a video no no it is in sync only no no we are just dropping the triggers which is existing on the master yeah which is existing on the master the reason for dropping the triggers is the trigger won't have any effect on the slaves so this is the way PTOAC works in the row based environment this triggers will still present in the slaves but the triggers won't get invoked on the slaves the events the updates and inserts happening on this one will be forwarded to this stem table using triggers but in this case of slave the new table get the data from the binlog events not from the triggers so if you drop the triggers it won't be any issue though the next scenario is when you are when you are running a PTOAC on that on the table which is having a data masking column in that case what will happen is in that case what will happen is we need to so as I mentioned previously so if you go back to this slide you can see this table has a in before PTOAC was successful this table has a mask data but this table does not have a mask data due to it is directly directly replicated from master to slave using binlog events so when the PTOAC is successful this table got dropped under this stem table will be replaced as a old table so in this case the columns which you were masked on this table won't won't exist because this has a this has a real this has a unmasked data from the masters so in that case what you need to do is you need to stop the slave and then update the table with the you need to run an update command for a masking the columns after that you need to deploy the triggers for the triggers for that table and then start the replication so these things also will be handled by our automated scripts so this is the example for that one so this has a mask data and then this one has a unmasked unmasked data because it is directly coming from the master so in that one the mask data this table got dropped and unmasked unmasked data which is existing in the stem table will become a original table so in that case what you need to do is you need to run the update command and then create a trigger for the tables and then start the slave so all this process were fully automated and we were deployed and we were successfully running this whole project for more than two years so still now we don't have any major issues for this one so all the scripts were available on the github so you are welcome to download it and use it if you guys want thank you