 Hello everyone. Today I'll be talking about master sensitive data in logs with LogStash. I am Ayush. I work at Aposite Coazer, DevSecOps engineer. So let's begin. What shall we be covering today? So we'll talk about what is sensitive information. We'll talk about the problem statement that we had in hand and the solution that we came up with along with the demonstration and we'll conclude the talk with the impact of the solution that we had and the probably questions that you guys might have during the session. So what is sensitive information? The obvious thing that comes to mind when we talk about sensitive information are passwords and credentials and that's perfectly fine and that's true as well. But what we need to think of, it's actually context dependent as well. So other information such as PII or any other field or some kind of information about the end users could also be considered sensitive based on the context. Let's say for example a company could consider someone's phone number which is PII essentially to be sensitive and it's context dependent depending upon a law to follow, maybe a regulation of the company has internally and so on and so forth. So we could have a bunch of things that we can consider sensitive information based on context. So what was the problem statement? The problem statement that our client which is a B2C internet unicorn had is that they had an old application that they had been working on for a really long time and it was already logging some PII information and the developers had access to those logs because obviously they'd need it for deworking and so on and so forth development and stuff. So the issue that they came up to us with was the fact that they wanted to remove these PII from the logs and why do they want to remove the PII? Obviously the developers do not really necessarily need the exact information about the end users and the clients that this company had or the client had and I mean it's actually privileged information as well. So the developers need access to the logs but not necessarily the exact information about the entities. So yeah and this was in production. So this is not just staging data this is like action production stuff so we didn't really want them to have the developers to have access to stuff from the production database about the clients. So what we came up with is that we perform data masking in the logs and how will we do this is with LogStash. Now LogStash is a component of the popular data ingestion system ELK which is Elasticsearch, LogStash and Kibana and LogStash essentially is configured with like three segments which is input, filter and output. Input being the different segments that send logs to LogStash and output being the output source that LogStash emits to it could be Elasticsearch or it could be the terminal console as well and a bunch of other things. The interesting bit is the filter part and that's the where we wrote the config for which selectively figured out which fields to mask if they existed in the first place in the log entry that we're talking about and then mask that particular kind of data. So this is the LogStash configuration that I mean I've written for the purpose of this talk this is like a POC thing. So we have a source log entry essentially and it's a Stringify JSON essentially when it comes to LogStash. So we first converted two other entities or like fields in the LogStash entry which is past JSON 2 and 3. I've done this twice essentially because I wanted to mask it once for you to see the difference between when it's masked and not masked. So this is the configuration. I'll quickly walk you guys through a demonstration of the whole setup. So this is the same configuration that I just showed you guys in the slides we'll quickly move on to the LogStash thing. So LogStash is currently running and waiting for logs to be sent to LogStash. And this is the file bit config. File bit is the way that we'll be sending the logs to LogStash and this is the particular file if you see the path. We'll just start file bit now. And lastly I will go to the file and start adding logs. So here are the two log entries that we'll talk about. Both the log entries are JSON essentially and they have a key called message and the message contains another Stringify JSON at this point. The first log is perfectly fine. It has a normal data key, normal data value which we do not really consider sensitive and hence ideally we would not want to mask it. And the second is the log that we consider is to have some sensitive information. That's a PI. So this particular key, if this exists, is what we want to mask in our setup with the configuration that we have. So yeah, I will just save this and it should show up on LogStash fairly quickly. Yeah, there it is. Let me scroll to the first one. So if you can see the message essentially contains the entire log entry as another Stringify JSON. That's how LogStash works. So first I need to convert this into an actual JSON entity which I do in parsed JSON which you can see here. It's message and then the Stringify actual value that was inside it in the log as the log was in the file. Then I converted to JSONs again which is under parsed JSON 2 and 3 here. But both of these are identical because you see the same keys and the same values and unmasked plain text values here in this particular case because as per our condition we didn't really need to mask this. Our condition specifies it very specifically to mask the data for. So yeah, let me just scroll down to the next one. Okay, so here we have the other entry. Again, same stuff. We have the JSONified entry. We converted it to a parsed JSON thing and then we converted to a parsed JSON 2 and 3. So if you can see the parsed JSON 2 because you're not performing any kind of masking for it, it has the plain text value along with the key. Whereas in parsed JSON 3 the data is masked. I've just replaced, I essentially just replaced the value with something that I just came up with, a bunch of stars and masked data. So the log structure remains the same. I mean, I obviously will need to remove the message part because that still contains that value but the idea is that it's masked right now. So I can remove the field that I do not want in this particular log. This is just the default version of the output of the log. So the data is masked at this point and then we can continue on. And this is the log entry that the developers would see when we send it to them. Not the source entry at this point. So yeah, I will go to the next slide. So this is the end result that we talked about and we saw in the video as well that when we didn't mask for it, this is essentially the unmasked data which could have been in the developer's hand if we hadn't masked it. And this is what essentially they have access to at this point right now after we have masked the data. So coming to the impact of the solution, I mean the first and foremost thing that we were happy about that the fact was that the developers didn't really need to refactor anything. We were able to solve for privacy essentially without them having to do any additional work on top of it. We just configured the like the off side of things that the logs come to lockstash essentially and we do the data masking there and then we send it to Elasticsearch plus Kibana. And Kibana is where they'll have a UI to look at the logs, the developers. So they essentially got cut off from the source logs. The next point being can we extend it to other applications in the ecosystem? So now depending upon the other different kinds of application, different entities could be considered, let's say sensitive, as we talked about in the first few slides. So again, for the same thing, now that we have a PUC configuration and we have the ability to check for specific entities that we want to mask for, not blindly mask any and everything. And it's fairly simple as well. It's not some reggae, smatching or something else. So this was easily extendable to other applications in their ecosystem as well. And the biggest plus being that the developers were unable to do their jobs while the problem was solved without us ever being a blocker for them. So they were able to continue to do their work. We were able to solve for security and privacy and nobody was unhappy or was being blocked to from like being able to achieve their goals and work. So coming to some probable questions, one of them, I mean like obvious question would be why would we log sensitive information the first place? I mean, in an ideal case, yes, we definitely shouldn't log sensitive information or information that we consider sensitive like we wouldn't do it for passwords and stuff. But in this particular case, it was already being logged. And the ideal scenario of it not being logged would have taken a longer time frame or duration to achieve. So this would have been the perfect goal, but a more tedious goal to achieve and will take more longer time. So as per the concept of good enough is good enough. We went with the data masking bit and perform some sort of access control stuff so that developers get access to what they want to get access to and we protect what we want to protect. And the logging of sensitive information can be like tackled eventually. Now that we're solved for this problem, it's not an immediate, let's say an issue, a security or privacy issue right now. So eventually the application can be converted or refactored into not logging the information in the first place. So this can be an, let's say an over time goal, which obviously would take a longer period of time than just masking the data and the logs as we did. The other question being the original logs still have the PII. How does like this particular solution that we just saw solve the problem for the original logs? So as we saw the logs, I mean obviously one source definitely will contain those PIIs. The way we solved for this issue, this particular issue in our solution is by doing some sort of access control. The original logs are at this point cut off from the developers essentially. So they do not have access to the logs. They are, I mean from wherever the computes that they are running in, the logs get sent to Logstash and the developers don't have access to those computes either. So those logs come to Logstash. Logstash is where it gets manipulated and it's being sent to Elasticsearch slash Kibana for the UI bit. And that's the only place that the developers have access to. So that's where the developers get access to the logs to, which they can use for debugging and other stuff that they want to do. So I mean the ideal solution again would be to like not log sensitive information in the first place in the log itself, but for now the access base control thing. And again the whole data I'm asking about is like the combination of the two essentially solve this problem that even though the original logs still have those PIIs, we do not give access to the developers to those original source logs. So in conclusion, we talked about information can become sensitive based on the context. So depending upon the context that you derive from let's say a company regulations or some laws or whatever, different kinds of information can be considered sensitive. Then we talked about Logstash. Logstash is essentially a tool that we can use to mutate slash filter logs before we send it to wherever we want to send it to. It most popularly is Elasticsearch obviously, but it could be any place at all. So Logstash is our go-to thing like for mutating and changing the log entries essentially. The solution that we came up with essentially enabled the developers to continue their work while we the security folks ensured privacy. The aim ideally is not to be a blocker. I mean we want everyone to be able to do their jobs without each other like any team being a blocker for each other. So yeah, I mean that's the whole point. So thank you. Thank you folks listening to my talk. Amayush. And if you have any further questions or you want to connect to me, you can hit me up on these social accounts that I have. Thanks. Hey guys. So I guess that's the end of like all the talks. So I'll just wrap it up in a way. There were a few questions that were answered. I think they were extended as well from the other speakers. So there were a few questions that were like asked to me last time that I gave this talk was the scalability, the ease of implementation, which I think has been tackled right now. As for the scale, so Logstash is one of those extensive applications that requires a lot of compute and resources when you deploy it. So yeah, as you scale it does get a little resource intensive. And the thing is you can extend it further. You can make it more complex by implementing RegEx patterns. But for the most part and what I demonstrated in the talk as well, it was like a JSON entity. So if you know the exact keys or like the placement of those particular dictionary keys in that log entry, you could like simplify the process a lot. And there's a few other benefits to like using Logstash is the fact that it's like a central ingestion place. So you can have like multiple applications that are sending logs to Logstash. And then they end up in elastic search or wherever you want to like send it to at the end of it. So that allows you to like have configurability in a central place that can like extend to almost all of your applications or like even files or systems or like any kind of thing that you can like send logs to Logstash from. So that's a great thing. But iterating on the fact that I believe it's not the end or the perfect solution like using the Logstash thing, it might be a solution that bridges the gap towards eventual code refactoring where you do not log those sensitive information because at the end of the day, even if you're masking it, it does exist in some place. And by, I mean, some people will always have access to that data. So if the idea is if you refactor it, it's lesser number of people that have access to that data. For example, let's say a database admin will always have like the access to the production database, for example, but adding Logstash on top of it, some ops people might also write and the thing is if you can like refactor the code, then you're also reducing the attack surface that you could have like internally by not having it in multiple places and just like one place centrally. And again, Logstash is a great tool to like perform this masking operation, some mutations on the logs that you want to like put in places, but it might just be okay for a little bit, but eventual goal should always be to like look for the proper optimal solution, which is like making sure that your applications don't log the actual log or sensitive information or PII. And there were some questions asked, I guess I'll just address them quickly. So what are some major integrations with Logstash? So elastic search will be one of the most popular ones, but there's other ones also that you can like take or ingest logs that are sent from Logstash as well. You can check that I don't remember all of them right now, but you can definitely like get that with the Google search. I was asked if I am integrating this with some SAS or IT software, primarily what are those? So we use the managed elastic search cluster that AWS gives you, which is no open search at this point, but that's the one that used the Logstash instance was deployed and like maintained by us to all the configurations and everything. The elastic search part was maintained by AWS. That was a managed cluster. Okay. There's one more question, which is do you use cryptography in the masking process or just hide it in the developer interface? Cool. So again, there's no cryptography here. It's just replacing the value that we consider or deem sensitive with like a placeholder value, which is, I mean, it could as my demos said, like, could be just a bunch of stars and says mask data or like anything at all. It could be a generic let's say phone number or something. But again, there was no cryptography involved because the place where these masked logs end up is elastic search and the developers had access to those elastic search clusters and the log groups. So they would only get the masked data. It wasn't in a way that they could like get the original data as well because that gap was built intentionally into the process where they get the logs through elastic search. So there was no need for encrypting those values or like put in additional processing steps or like make it a little basically more complex in the masking process. I mean, we could definitely like mask those values and then put let's say a hash or like the encrypted value for that sensitive information as well. But it's kind of unnecessary because it just uses more computing power while we could just achieve that by replacing the value by like any generic value otherwise. So in this particular case, there was no cryptography used. But I mean, people can argue that there are certain cases where you would use cryptography and I would definitely agree to that. I mean, that's why those certain plugins and hashing filters also exist for lockstaff. So you can definitely encrypt it if your use case allows for it or like needs creates a demand for it. But it's a contextual thing, depending on your use case. Hi, thanks, Ayush. It was a good session. I'm Rohan from Zyudha Tech. In our setup, we collect logs and ingest them into a pretty much standard out-of-the-box setup of VLK with some minor configuration changes for scaling up. Due to our regulatory obligations, we need to have specific retention policies for storing logs. These could include access logs as well as application logs which are usually ingested from various multiple instances of our internal apps. We apply masking on critically sensitive data such as application access tokens which can be used to gain temporary access to the system. Due to reasons such as regular audits by regular appointed system auditors, we are subject to storing logs without masking certain personally identifiable information or user identifiers. This is needed to provide audit trails for the auditor to review as well as handle and respond to user complaints, for example, to the exchanges. We therefore store data as per the regulations, but on the UI as well as at the application level, we mask the data. For example, our API responses which are then displayed to the outside world. I think it is indeed a sensible choice to follow the suggestions given by you. As you mentioned, it is not that difficult to implement either and it should be adopted as a best practice by other organizations as well. Thanks again for this informative session. Thank you.