 Hello, everyone. My name is Kaliswillingam. I'm a Principal Engineer at Intuit. I'm going to talk about data governance and compliance initiatives that we have done in Intuit and the learnings from it. So the agenda for this discussion is, I'll start with quickly introducing about what we are doing in Intuit, and I will also quickly talk about the data privacy aspect, what we are trying to solve, and I'll go through the high-level design approach, what we have used, and in that design, what are the challenges that we went through, and we can now finally a Q&A. I'll start with the introduction about Intuit. So Intuit is primarily into financial domain, provides multiple product lines. These are some of the key products that Intuit provide. The number one product Intuit provide is a TurboTax, which provides a tax automation to our customers. The second product that is shown here is the cookbooks which provides the accounting automations to our customers. And the third product is Mint, which provides a personal finance management for our customers. And there are other products. I'll keep this short. Now in all of these products, Intuit uses key financial information about our customers. And recently, we have seen that there are a number of countries have introduced regulations, which provides an increase to the rights to the customers. And that's the topic about data governance and privacy. Now these are the two most important regulations that started the journey. Number one is GDPR, which is Global Data Protection Regulation, which was started by European countries. So this provides the data privacy rights to the customers of European countries. Similarly, a law called CCPA was introduced in California that provided an increased rights to the Californian customers. Intuit being one of the stewards of our regulations, we decided to support these regulations across all the regions where we operate. So what does this mean for us? There are two kinds of requests that Intuit has to enable in order to comply with this regulations. Number one is right to access information. So what does this mean? This means that Intuit needs to enable customers to be able to know what Intuit knows about them. So this might mean that the data that we know about them or the documents or any other information that we have collected through the third party. And how is this delivered to our customers? These information are collected across different places and made in a simple archive so that we can provide this information to the customers in an easily consumable form. And we have to do it in a time-bound manner so that we comply with those regulations. The second rights that is provided by this regulation is right to delete request. Customers can request to delete all of, all or partial of their information from the companies, in this case Intuit. And customer can also selectively choose what offering information that they wanted to delete. And again, even these requests have to be fulfilled in a specific timeframe. So how did we vent about solving this problem? So this diagram shows a very high-level architecture about the approach that you've used for solving this problem. There are four components in this flow. The first component is the request manager. So this manages the request that comes from our customers to perform either of these operations. Now, when the manager receives this request, it employs a queue or a topic where this information is published. And then this is the second component. The third component in the flow is the data manager, what we call it as a data manager. So there are multiple of these data managers that are implied. Since there are many domains within this personal finance product that handles the data of our customers. And individual domains do keep a data manager so that they can focus on their work. The fourth component is the document management platform. These individual data managers as they collect the data, they publish the data to the central doc management platform so that this information can be archived and provided to the customer in the case of an access request. Now, in the case of a delete request, the same flow happens from the request managers to the queue to the data manager. And the data managers in turn connect with their individual services for which they are responsible for and perform the delete operations. So this is the high-level flow involved in solving this data privacy problem. Now what are the top challenges that we went through? This slide shows the top five challenges that we went through. The first one is competing compliance regulations. So since we are in the financial industry, it is not the data privacy is not the only regulation that we have to comply with. There are different kinds of regulations under which our company has to go through. Now, sometimes these regulations are competing with each other. We will go into a little bit detail on as we go into the details of this. The second one is, as we have seen that there are a number of systems involved in solving this problem. So we have to make sure that the status of this request is handled in a distributed manner. And how do we do that? It's the second challenge that we went through. The third challenge that we went through is we employed message bus and message queues for solving this problem. Now, these message queue architectures required specific challenges in terms of how do we scale for the request that is being provided to us. So again, we'll go into the detail as we go through the slides. The fourth one is, how do we organize the content from a different data managers so that we provide usable information to the customer? The last one is a reasonable offline processing. Now you can imagine that there are a number of data managers that provides this information and these informations are to be processed in offline to produce a single archive. While we are producing a single archive, there could be any problems that we can run into. It could be an intermittent network problem. In such cases, we want to make sure that the processing is resumed from the place where it's left. So that's the first problem that we faced. Let's go deep dive into each of these problems. The first one is computing compliance regulations. So in the case of a CCPA, customer can request, he has a right to request to delete the data. Now, let's suppose if the customer has a Cookbooks Capital account. Now, Cookbooks Capital is the one that process loans for our small business companies. Now, when they process the loans, they use certain documents and data in order to process that loan. Now, Cookbooks Capital has to adhere to CCPA regulations as well as it also has to adhere to other financial regulations. In this case, consumer financial protection brews record returns and regulations. So any loan that is provided, where the loan was provided using some of the data and the documents that are provided by the customer, these supporting information have to be kept for certain duration of the time. Now, in this case, there is one regulation that requests us to delete the information and other one requests us to keep this data. So we needed to have special handling. The way we have done that is, we have to keep each data element and the document where we have to keep an attribute that can let us know whether this data is required to be kept for another complaints or not. And when the CCPA request comes in, the Cookbooks Capital in this case offering would request to delete that information. Before we delete the data, we needed to make sure that we have to check whether there's any other complaints for which this data is used. If so, we can keep that data for till the complaints requires that data for it. The second one is distributed status tracking. So as we have seen in the architecture, the delete an access request spans across multiple domains. And each domain in Intuit contains multiple services. The service itself, in order to delete the data, it might require longer time than a timeout that can happen when we make an HTTP call. So what that service does is typically it makes an asynchronous processing in order to fulfill this delete request. So when it does an asynchronous processing, it has to return the status back to the caller. In this case, caller will be a domain data manager. So the domain data manager has to orchestrate between multiple services for which it is responsible and consolidate the status back into the overall orchestrator who calls this domain data manager itself. And the overall orchestrator has to collaborate with 100 different data managers to track the overall request status. So there is multiple places where we needed to make this status tracking available. So we have built infrastructure in a place that where we can kind of keep track of the status to fulfill the complete request for the customer. The third one is specifically around the message distribution and scalability. We leverage ActiveMQ in order to process this work order request. So when the work order comes in, they come into a topic or a queue and they stay in the topic. Now this topic is supported by a distributed set of brokers. So here in this case, there are three brokers that handles this processing. Now in the case of a distributed broker architecture, what happens to the queue sits in each of these processors. Now the consumers who are connected to the topic or a queue are really connected to an individual broker. Now if suppose if a consumer connected to broker one is able to perform the operation very quickly, then the number of messages delivered to that broker's part of queue or the topic becomes a lot more faster and it accumulates more messages in the broker. Now the other brokers, even though there may be consumers to work, they may not be able to perform their work because they don't have enough messages to process them. Now what we have to do is we have to tune this whole system in a way that we compute the consumer's processing speed and the amount of messages that individual brokers can bring in so that we equally distribute this message to all the consumers to effectively process this request. The next problem is how do we organize the content that we are going to provide to our customers? So you can imagine, Intuit provides multiple product line and each product line has multiple data managers and they all in turn have multiple services and all of them bring the data in certain formats and certain structures. Now if we dump all of this data into an archive, it may not be very useful to the customer. So what we do is we organize the contents in folders and subfolders and within the folder where there is a data available, we also provide a readme file. This readme file provides the structure of the individual files content and how this data can be understood is explained in the readme file. So this way it is easier for the consumer to be able to open his archive and read through and understand which place where his content is stored. The final learning that we went through is the resumable offline processing. Now as you have seen, there are 100 different data managers operating on a particular work order. Now this work order might require several offline processing within individual services. Assume this is one of the data manager that is performing an archival process of the files that is received. So in this case, it could have collected millions of files or thousands of files. Now these thousands of files needs to be archived in order to provide the single archive to the customer. Now when we do that, this typically takes a longer time and during the execution of the process, it could happen that there are network fault or infrastructure issues that causes this process to fail. Now suppose if I'm processing like 100,000 files and if my failure happens after the 50,000 and if I don't have a way to start from where the failure happened, I will have to start from the initial file itself. That means again, I'll have to process the 100,000 files one more time. So in order to avoid this, the way we have done that is we extended the status tracking mechanisms not only to the data managers, to the services to even the output records that these services are brought in. So we will be able to track each individual's files status whether it has been successfully archived or not. And based on that, in case if you run into problems, when we have to restart, we start from the place where we have left. So these are some of the key learnings that we went through while we were implementing the data governance into it. With these, our customers are able to give requests to access their data and also delete their information successfully. And this whole process is really liked by our customers. And typically we get number of requests in a day that we process through these systems. Thank you. Thank you very much. That was a very detailed and insightful talk. So are there any updates regarding compliance and the architecture that you've used at Intuit in the past year or so? Anubhash, thank you for asking that question. The architecture that we have designed a couple of years ago when the CCPA came into force in California, that kind of stabilized and is extensively used beyond CCPA as well. We're able to extend that to multiple other compliance needs, AUCDR and GDPR and other compliance needs. So there is not a major changes from the architecture side, but essentially the same pattern is leveraged for multiple use cases.