 ask you something you can comment before. And Rishwin, who's on Hasgeek AB, who will play the talk, will take the cue from Rishu as well. So when he says over to the talk, Rishwin will play the talk. So if you want to, so you can play it free, yeah. We'll take the cue from Rishwin will play the talk. Is there a delay somewhere? Is there a delay? Yeah, there's some weird delayed echo. There you go. Hey, Jan, is it at your end? I had the YouTube window open at the same time. So I just closed that. That was maybe causing some. YouTube is delayed. So yes, that's the problem. No worries. All right, let's wait for folks to come in. Sorry, Jan, go ahead. Yeah. So we are live now. So we can start in two minutes, I think. Which means we are live on YouTube. That's what it means. All right. Let's get started. So welcome everyone. Great to have folks on this live chat. So today is the first talk that we're going to have on privacy as risk assessment and risk mitigation project hosted by root conf in collaboration with privacy mode. Now let's quickly want to touch upon what privacy mode is. So privacy mode is a program at Haskeek over here where we talk about technology and all the policies related to data protection and privacy from consumer and manufacturer standpoint, both. Right. I'm Rishu. I'm your host for today. I am a technology leadership consultant at Boston Consulting Group. And along with me, we have Jan. Jan comes with a distinguished experience of almost two decades across multiple companies such as Yahoo and He's born here. So we're going to hear from him in a while. Before we start, quick word about the project. So as in when user data is becoming more and more common places, getting more and more distributed across various companies and various firms for various technical use cases as we're moving more towards digitization, then you have new and new compliances that are coming up. And a lot of these organizations have to ensure that they are compliant when it comes to sharing user data, given how much of a concern privacy is in the current ecosystem and landscape. Right. If you look at a very simple example, GDPR was something a lot of organizations have to work on based on my own experience as well with companies such as LinkedIn and Visa. We had to ensure that almost decades worth of data had to be made compliant through and through because there's no window on customer data as such when these compliances are being met and they have certain legal ramifications as well. There's an upcoming PDP bill arising as well, which makes it very imperative for organizations to assess their privacy preparedness through the prism of data risk, given the amount of major leaks, data breaches that have become common place in the past half a decade or so. I think we all have a certain element of risk when our data gets exposed primarily as and when sensitive industries such as fintech are now moving towards account aggregators. They're also moving around, let's say, lending a lot of non-financial players who are new into the field are now coming in and more and more of people's personal and financial data is being distributed, which makes some of these compliances and how organizations need to and what all they can do as fast as they can in a very shift-left way possible to meet these compliances so that they don't get left behind in the product spectrum. Now, through this project, let's call it PRAM. Since a word we would love to call it because it's quite a long name without an abbreviation. What we want to do is we want to kind of have participants slowly become aware to design organizations top to bottom, how to manage privacy risk, how to manage data at a scale, how to do more and more shift-left of security validations, giving the consumers maximum amount of trust as in when consumers are going to entrust their data into their hands. Now, with that pretext, I'll quickly give a much more detailed introduction for Yan. Yan Hecking is a chief architect at Borneo. This is where most of the talk is going to be around the work he's been doing at Borneo. Again, he's a distinguished career almost two decades across the industry as an architect with some of the big names such as Aerospy, Yahoo and multiple more. And what he's going to primarily talk about is the shift-left and how to fast-track some of these compliance requirements across organizations. Before we start the talk, how we usually go with most of the Hasgeek and rootconf policies, please leave your questions in the comments, both on the YouTube chat window. And if you're joining us via Zoom in the Zoom window as well. At the same point in time, post the call, Yan will be taking questions. Feel free to raise your hand, feel free to on-air ask questions. We'll be super happy to answer both of them. Before we get started, quick plug-in about rootconf. I believe most of us already here are regular participants. Started in 2012 as a DevOps and an SRE practice rootconf has now spent more than a decade across the industry, bringing practitioners together to talk about not just existing but also emerging trends now. From DevOps, we've come to DevPsychOps. We've started branching out into MLOps, into CloudOps, into DataOps and Privacy. And our editions are usually organized in Hyderabad, Pune, Delhi, pretty much all over the country and our goal is to bring the industry practitioners together so that we all can learn from each other. And with that, over to the talk and looking forward to interacting with everybody post the talk as well. Thank you so much. Hello, my name is Yan. I'm currently the chief architect at Borneo. And here to the right, you can see a picture of myself with my daughter on Borneo a couple of years ago, one of our family vacations. So Borneo is a start-up based in Singapore. But our large part of our engineering team is actually based in India. We have been building what we call the guardrails of the data economy since 2019. And what do we mean by guardrails of the new data economy? So both myself as well as our CEO and founder, Britt B, have been working at large scale start-ups like Yahoo, Facebook and Uber prior to founding or starting Borneo. And at these places we saw firsthand how these companies were amassing large amounts of user data and the inherent value in that data. But also conversely the risk associated with that data. So for example, sensitive data, user data got leaked, we can potentially harm the reputation of the company and the road user trust. These companies were comparatively big and had large security teams and had the resources and the skills to build our custom solutions to protect the data of their users and make sure that it was protected adequately. But really what we have set as our mission at Borneo is to build tools that empower companies of any size to protect their customer data and prove user trust. So while with the move to the cloud, every company nowadays is a data company collecting large rapidly growing sets of data, not all start-ups have the same sets of resources, skills or tools to build their own custom security solutions. And oftentimes this leads to what we call privacy debt where potentially sensitive data is collected, but it's not secured adequately, the right security measures are not often put in place. And this can lead to data breaches, it can lead to other mismanagement of the data which in turn can cause erosion of user trust, especially if some of that data comes public. So the solution that we have built at Borneo basically allow our customers to gain real-time visibility first of all into their data sets. So at the heart of Borneo is basically what we call our inspection engine. Inspection engine is capable of ingesting large amounts of data from various sources through a set of connectors and is able to inspect that data and to detect any kinds of sensitive information. So we call this like sensitive infotypes. So examples would be passport numbers or credit account numbers, bank account numbers, but also just personal names, postal addresses, already any other kind of personally identifiable information, PII, or financial information or also healthcare or medical related data. So gaining visibility is oftentimes the first step to understand where sensitive data is stored and what is the right way to protect such sensitive data. So without Borneo, security teams can moistily meet their customer data privacy obligations even with limited resources. Today I wanted to talk about one example of how Borneo was able to help a customer of ours with a particular data or privacy data challenge. So this customer is a large Indian fintech startup. They have been around for several years and they offer their own prepaid credit cards as well as nowadays digital wallets and because they are handling such sensitive information, they have to comply with what's called the payment card industry data security standards or short, PCI, DSS. So that is a large set of security best practices and security measures that any company that wants to process credit card data needs to comply with. Depending on the volume of your transactions, you may have to undergo annual audits to prove that you are complying with these regulations. But really similar challenges apply to any other kinds of regulation, be it the general data protection regulation in Europe, the GDPR or similar laws in California, CCPA, PDP here in Singapore or the upcoming regulation around sensitive data in India as well. So when it comes to managing compliance with such regulations, we can really break down that process into three kind of high level steps and I like to call them descope, de-risk and document. So the first step in this process really is to identify which systems that make up your data infrastructure really handle the data that falls under this regulation. So in the case of PCI DSS this would be card holder data, so the actual credit card number, the expiry date, as well as the 3-digit or 4-digit verification code and the card holder name. So any system that stores or processes card holder data would need to be considered in scope for PCI DSS and would have to comply with the numerous security requirements. So there are about like 200 plus security controls that are specified as part of the PCI data security standards. So that then is the second step. So in order to mitigate the risk of handling such sensitive data now you have to implement all of these security measures to ensure that this data is stored and processed securely and is not abused. And last but not least it's not enough to just implement the required security measures. You also have to document if and how all of these security measures have been implemented or other compensating controls. So especially these last two steps can be quite resource intensive tasks depending on the size of the card holder data environment. So really this first step of descoping is imperative because any system that you can prove not to contain any card holder data would not fall under the scope of PCI DSS and therefore would not require the same level of security control to be put in place as specified on a PCI DSS. So really descoping is crucial in this process. But once again you have to be able to document to a PCI auditor for example that the systems that you want to take out of scope that they really do not contain any card holder data. So this Indian FinTech company we're working with we're already processing credit card data. So they were already compliant with PCI DSS but they were processing such data in an on-prem system which they had built over the years and in order to solve growing set of new use cases and to make use of all the benefits that the public cloud brings they are rapidly expanding the operations into the cloud and now the challenge they were facing is that they had to be able to prove to their PCI auditors as part of their annual audit requirements that all of that new data infrastructure in the AWS cloud did not contain any credit card data and was therefore out of scope for PCI compliance because otherwise they would have to implement all of the same security measures across all of the cloud infrastructure which would have been a very involved process. So they started by looking at their primary data stores in the cloud which was a leader of RDS MySQL instances and we're using Borneo to inspect the data in these MySQL instances and so basically what happens is Borneo will ingest data from every single table in each of these RDS instances and will inspect the data to determine what sensitive information if any was stored. So basically it will determine for each column what is the info type of that column so that could be an email address could be an IP address or it could be a timestamp or of course it could of course be a credit card number an expiry date or even a validation CVV code so that's exactly what this team wanted to prove that their RDS MySQL instances did not contain such data so after performing the scan which took just a couple of days they had all the metadata about all of the sensitive info types and were able to slice and dice that you know see exactly which of these tables were containing sensitive information and in the end also produce a report detailed report about the kinds of sensitive information contained in these RDS instances and you know these reports actually did show that there was no credit card data for RDS MySQL instances so without they were able to go to the PCI auditors and convince them that these systems were out of scope for PCI DSS compliance but unfortunately not all was good because they had also used Borneo to scan their S3 buckets so they had dozens of S3 buckets and storing hundreds of thousands of terabytes of data in them and unfortunately within a few hours of starting these sample scans Borneo had detected some credit card numbers in one of these buckets so that was the problem because now they had they knew that they had credit card data and at least one of their S3 buckets and they needed to find out why and clean up that data because none of these buckets were supposed to contain any card or data so when Borneo first detected credit card data it immediately raised an alert and also filed a ticket in the customers JIRA and so this ticket contained quite a bit of context already like specifically which file or which set of files had contained the data so it looked like these were some application logs that ended up in S3 but these files were quite large so as you can maybe see in the screenshot just as one example there was a wallet log file which contained more than 4 gigabytes of data so initially the engineering team was struggling a bit to pinpoint exactly where in this file the credit card numbers were found since Borneo had not detected a lot of credit card numbers enough to be a problem but not really a large amount but still so in order to pinpoint the system, the problem the source of the problem what the team did next is they again used Borneo to do a much more detailed full bucket scan so initially they had only run sample scans which just sample a small set of data from each bucket to just get a general idea of what kind of sensitive information might be included in the bucket but now they were running a full bucket scan and these scans generated a very detailed result of a list of findings with every single token that was detected including the line number as well as some other information about the context like some keywords that were found close to the match token that indicated the type of match as well as column names in the case of CSV files for example and these detailed findings helped the engineering team to locate the credit card numbers in the files and then based on the specific log entries to also determine the root cause of what was causing the credit card numbers to to get locked so as you can see here in this screenshot what had happened is that one of the systems was expecting credit card numbers to be passed as an integer number but instead the system was receiving the numbers as a formatted string and that was causing a number format exception and it looks like this was some error message from an ORM system or something like that so the engineering team was able to pinpoint that and suppress these kinds of logs going forward so going forward these logs would no longer contain credit card data so all of this work had actually been done as part of a trial that this company was running with Borneo in total it took about three weeks and the main outcome for our customer was that they were able to fast track their PCI compliance so Borneo was able to generate these reports and detail these findings within days whereas otherwise it might have taken the team weeks to produce the required documentation to convince their PCI auditors to take their AWS cloud infrastructure out of scope for PCI compliance since then the team has started using Borneo's commercial version and regularly using Borneo as kind of a general privacy observability tool to monitor their data environment in the cloud so they have by now established a baseline so they have a good idea of what information is expected in every single data store and whenever Borneo finds any sensitive information that does not match their established baseline Borneo will generate an alert either send them an alert in Slack or raise a ticket in Jira and with very minimal effort the security team can stay on top of their data security even as the application of engineering teams are adding new resources Borneo will automatically detect new resources new data stores as they are being added and automatically start monitoring them going forward what this team is now thinking about is how can they take this privacy observability and kind of apply it to their whole entire application development life cycle so instead of just detecting such issues where sensitive data is ending up in locks only once it hits production why not scan the locks of pre-production systems like a staging or curing environment or even development environments and look for such sensitive data there so that issues can be detected early and remediated before they become a production issue so we like to call this application data privacy management and it's just one of a suite of solutions that Borneo offers so from privacy observability to solutions for data investigations for example once you have detected a breach you know the impact of it as well as PCI, GDPR, CCPA compliance solutions as well as the next gen DLP so in addition to monitoring your cloud infrastructure Borneo also has a set of connectors for enterprise applications like Slack and GDPR sorry Slack or Jira or Google Drive so Borneo can also monitor data being exchanged through these enterprise applications and can look for sensitive data as well as application secrets or passwords for example and alert the either the security team or the compliance officers depending on the specific use case I hope this gave you a good idea of what it takes to kind of achieve PCI compliance at a very high level and how Borneo can help with this as well as with other data security challenges so thanks all for listening and happy to answer any questions you might have about Borneo, about the tech stack also we are hiring for our team in India so if any of you are looking for opportunities to work with large data sets exciting set of technical challenges feel free to reach out thanks, bye alright thanks a lot for that talk Jan and for everyone who is listening in we have the man himself Jan over here to feel any questions that we may have from audience before we quickly get started first up Jan this looks really amazing having worked with multiple clients and having been a part of down in the trenches I can attest to the fact how big a concern data privacy scanning and continuous security scanning a problem this entire thing becomes for most of the industry so this looks amazing amazingly well done something that I think is very plug and play while we are waiting for questions from the audience I probably have certain curious questions of my own being a practitioner so one of them was that as you said that we can go ahead and do sample scanning of some of the data we can run full length scans as well right the question here is that usually when teams try to move fast you know typical DevSecOps model or you know there are new fields that keep on getting added right so teams maintain a separate data catalog sometimes as well teams may keep on adding newer fields so how does Borneo get updated or get wind of these new fields altogether is it that it gets connected to a data catalog that data catalog has to be pushed to Borneo because it has a catalog of its own how is it like a rule based engine how does that work if you could tell us a little more detail sure yeah thanks for the question Rachel and great to be here and talking this forum so yeah as you rightly mentioned like with modern companies having lots of data in the cloud new data sources are being added tables are being extended security teams to be on top of all of that because typically the application development team the product team is much bigger than the security organization you know and product requirements keep changing and due to that also the data environment keeps changing and security teams are used to being very reactive just reacting to problems as they come up so what we are really trying to do is to give them tools to kind of stay on top of that to get a good understanding of the visibility of all the data that the company collects and coming to the technical details so since data size in the cloud are just enormous especially the rate of change if you just look at some transactional tables where companies millions of users are transacting all the time it's kind of impossible to do a full scan continuously on all of the data like you know even keeping the data sizes and S3 in mind so we've focused on kind of providing a good trade-off between gaining enough visibility understanding of the data sets while also being cost efficient and resource efficient so we're doing a lot of intelligent sampling so not scanning every single record that gets updated but talking about structured data especially now like your typical MySQL database or it could be like snowflake table as well or any kind of structured data so we do regular sample scanning on those tables as well and we do get the table schema when that changes we can detect that during the regular sample scanning so we can detect that for example maybe a new column was added to an existing table so then we do the next step the deep data inspection so we're not just looking at the metadata the column name in some cases might already be indicative of what kind of data might be stored in the column but you know increasingly you also have kind of a mix of structured and unstructured data you might have MySQL database but you have a jsonb column in there so you have like a rich data set a single value inside that column so just looking at the column metadata like the column name might not be sufficient it might just say like event data or something like that which doesn't really tell you what kind of data is actually contained within there so we actually look at the data itself we run it throughout inspection engine and we discover what is the actual data like is it just timestamp is it maybe user information email address or something possibly even more sensitive and for each table we kind of keep we keep a schema of each table not just the actual table schema but like the table schema that we have detected in depth based on our inspection engine so when new columns are added and we detect that the new column for example contains sensitive information that's not existing or that was not previously existing in the table it's not part of the table's baseline then we will go ahead and raise an incident and notify the security team either sending them over via Slack or raising a ticket in JIRA or just sending it into their Splunk or other SIEM system so that the security team can then react on that and the second part of the question you were asking about integration with the data catalog so we actually recently launched integration with data hub so that's the open source data catalog originally from LinkedIn but now there's a couple of commercial vendors so what we do there is we sent that same kind of information about the detected info types into the data catalog so for example if in your data catalog you have an entry for a particular table and we detect that one of the columns in the table contains bank account numbers for example Bonnie will then make a suggestion in the data catalog saying this table or this column seems to contain bank account numbers and then it's up to the data custodian so the team that owns that data to confirm whether that's an accurate assessment or whether it's an incorrect assessment, maybe Bonnie or mis-detected something and that information will again flow back to Bonnie so we get the feedback of whether detection that Bonnie has made is correct or not based on the judgment of the data owner Got it, thanks for that another question that just came to mind is that usually what happens is that as you said when structured data it's a slightly different thing as compared to unstructured data on S3 buckets and you also talked about how expensive full table scans can get that definitely is there but is there a way that Bonnie can actually do incremental snapshot scans so usually for example let's say if I have a huge database it could be any let's say MySQL where I'm taking incremental snapshots, can Bonnie go through the incremental snapshots and scan them on let's say a nightly basis when the snapshots are taken and quantify them saying that okay I'm going to gate this because the snapshot is not clean enough to go into the data store, is there any such We don't operate on snapshots at the moments we operate like in the case of MySQL or other relational databases we sample data directly out of the table itself it really depends a bit on the connector for example we do have support for S3 to detect new files or new objects being added into an S3 bucket and to focus on scanning new objects that are being added but even there we see some of our big customers the rate of change like the number of objects being added to S3 buckets can still be so huge that it's infeasible or not cost effective to sample or to scan every single file so even then we still have to apply a sampling on top of that and maybe only look at every 10% of all new objects being added into a bucket that makes sense and coming down to the rate of change usually what has happened is over the years what used to be your security testing penetration testing a lot of security validations that used to happen have been moving more and more left into the entire DevSecOps pipeline right like for example now there are CI CD pipelines that will essentially run sonar cube or check marks or any other code scan and they'll just simply flag the code at the pull request level itself saying that oh wait we are looking at something that could be a potential threat based on certain signatures now that is code which means that that's running on it but how much of this shift left can Borneo introduce as of today that is one question and what is it that the team at Borneo is targeting in terms of data moving forward as well and I understand code versus data are two very different things because code is still a deployable quantity data is not a data is more a dynamic entity because it keeps flowing in but just want to understand how much of that shift left is possible today versus what is it that the Borneo team is targeting. Sure yeah great question so yes we are focused mainly on data so we don't do static code analysis or anything like that at the moment but really we want to help the security team understand where their data privacy risk lies and so in terms of shift left what we currently do or could do is basically look at data flows before those applications reach the production state so one kind of proof of concept we've done in the past is looking at API data since back data that is being exchanged by an API either data sent to the API like it might be a user's browser who uploads some sensor information to a web service API so we can interject or intercept that data and inspect to see what kind of data is being submitted and then detect over time if there were any changes if the API suddenly starts collecting new user data that was not previously being collected and conversely also in the reverse direction like what data is being transmitted by that API back out to the user or in the case it might be between two different systems so there are several ways you could go about doing that inspection you could either have like an SDK running inside the application think of something like New Relic which has like an agent running inside your application and kind of monitors the performance of your application so similarly we can interject intercept API calls inside the application and just monitor the data which would be to have something at the network level to kind of intercept the packets between the client and the application and do the inspection at that level so it kind of depends on how far left you want to go so this could be something that could be running like in an automated CI environment where you can do like automatic capture of the network traffic and analyze that another option would be to look at the log files that are getting to see if there's any sensitive data in those application logs so there's various approaches we're kind of in the evaluation state right now talking to a lot of customers trying to find out where their most pressing needs are and trying to find a solution that would work across our large part of our customer base sounds great thanks for that and to be honest I'll be keeping you on for future releases from the Borneo team I think we don't have a lot of questions from the audience so I think we'll quickly apply the finishing touches here and Yan once again thanks it's been hugely educational hugely informative knowing what the team at Borneo is building and as we talked before it definitely solves a very very real problem something that at least I have seen almost on a daily basis right and yep and for folks who are tuning in or are going to be watching this once this goes live this as we said this is one of the first talks in PRAM right but we have a lot more incoming next one month we're going to have three talks from RazorPay and the details for those are going to be shared on the Hasgit social media platform as well as our website Hasgit.com really soon so if you haven't already just keep an eye on those and we'll be looking forward to catch some of those talks and listen from you as well and with that I think we're going to call it a wrap once again thanks Yan thanks everyone hope everybody has a great weekend moving forward thanks a lot bye