 So today we're here to talk about 401, a framework for managing security alerts, which we will be open sourcing after DEF CON. So before we get started, let's do introductions. My name is Kai Zong and I am a product security engineer at EPC. So I'm responsible for helping developers with running secure code and maintaining some of the internal applications that we use on security team, like 401. And on occasion I've been known to wear many hats, like you see in that photo. And after this presentation, I'll be tweeting out links to the slides on my Twitter, so follow me, please. Got to get those followers. Sorry, I'm supposed to make a really, really bad pun here. Hopefully you won't find our presentation to be unbearable. Yes, you've grown. Thanks, Kai. My name is Ken Lee. I'm a senior product security engineer at Etsy. I'm glad to be back at DEF CON. I was here three years ago for a presentation on content security policy. And two important facts about me. One, my Twitter handle is Kenny Son. And two, I really love funny cat gifs. So I've managed to sneak one into the slide deck. For those that don't know, this adorable cat is Maru. So let me go and start by explaining what Etsy is. Etsy is a marketplace for handmade and vintage goods. The security team at Etsy is responsible for keeping private members personal information, such as credit card details, their addresses, et cetera. Oh, in addition, the Etsy security team has been successfully running our own bug bounty program for the past four years as well. I'm going to go into some more details about what we're covering in today's presentation. First, we're going to start by talking a little bit about the history of our transition to using Elk. We're going to go delve into some of the problems that we encountered during this transition process. And we're going to talk more about our solution, which we call 4-on-1. Then we're going to dive into how we at Etsy do alert management using 4-on-1. We're going to show you some additional more involved examples. And we're going to finish things off with a non-live demo. I know I really wanted the live demo, but I never trust the demo gods to get it right. First, we're going to go over some terminology. For some of you, this may be old news, but we're going to try and get over this as quickly as possible. So for those that don't know, this is a log file. Logs are typically interesting messages generated by a web server that stored in a log file. This is the Elk stack. The Elk stack is consisting of three different technologies, Elasticsearch, Logstash and Kibana. And I'm going to quickly go over what each of these different applications do. The first, as represented by our friendly Mustachioid log over here, is called Logstash. Logstash is our data processor and log shipper tool. We primarily use it as a way to identify interesting fields that we would want to perform in the future. In addition, we also use Logstash to ship logs into Elasticsearch proper. What is Elasticsearch? Great question, me. Elasticsearch is the distributed real-time search engine created by Elastic.co. It allows for storing complex nested documents, but in this case, we primarily use Elasticsearch for storing log files, parsed by Logstash. In addition, Elasticsearch allows the generation of statistics of your data so you can run interesting aggregations over the information that you have stored in Elasticsearch, which lends itself very well to analysis of the data that you have. Finally, the K in Elk stands for Kibana and that's the data visualization web application front-end for Elasticsearch. Kibana allows for log discovery and more importantly, debugging of problems in your application. And in addition, Kibana provides for some interesting visualization options. Unfortunately, this was the best stock image that I could find of Kibana to show you what it does. You can do interesting pie charts, graphs, etc., using Kibana as a front-end. So now, let's talk a little bit more about the history of how we transitioned into using Elk. So Etsy switched to using the Elk stack back in mid-2014 from Splunk and the work took about a year. And throughout this process, we both learned a lot of good lessons from the migration process and we got a bunch of great tools out of it, including 4-0-1. But it wasn't a super easy road to go down. We were aware of the fact that we were going to run into issues when we started transitioning to using Elk. And we had to deal with a fair share of really annoying performance impacting bugs with our Elk cluster. In addition, the security team was concerned about the usability of Elk as a solution for being able to do some of our alerting and monitoring. So to give an example of one of these bugs, here we have two genetic articles, one from September of 2014 and the other from April of 2015. That's a span of about six or so months. Basically, this article illustrates the discovery of a bug with Samsung's line of solid-state hard drives and the fixed acknowledges coming out about six months plus later. So unfortunately for us, our Elk cluster used these SSDs to power the Elk cluster. And so we were affected by this read performance bug for more than six months. In addition, this is just a small snippet from an email we had an issue with a kernel level bug affecting how it was handling NFS mounts. This caused a lot of instability with our Elk cluster and unfortunately some additional outage downtime as well. So to say the least, these are just two example bugs that we had to encounter. At times, it felt like we were riding the struggle bus with the guards to all of the bugs and issues that we had to deal with with the guards to Elk. But that aside, Kai's now going to talk to you about some of the actual problems, not just bugs that we encountered when migrating to Elk. Thank you, Ken. So like most security organizations, alerting is a major part of how the security team at EPC knows what is going on on the site. And some mechanisms that we use for alerting are Splunk or used to use are Splunk, StatsD and Graphite. And unfortunately, when we first started this migration, we were making use of Splunk, save searches to automatically schedule queries to run on some sort of periodic interval. And Elastic Search didn't offer equivalent functionality at that time. And additionally, Elastic Search also didn't offer some sort of web UI for managing those queries that we were writing, which is pretty useful when, say, it's like the middle of the weekend, and you're getting spammed with alerts, and you need to make a change to one of the queries. But doing so would require a good push, and you don't want to break something. With some sort of web UI where everything is handled for you, you could just go in there, change the query, and then update it, and you're good to go. Now, the second problem was that we were just not familiar with the new query language that we were faced with. Our old queries were built using SPL, which is the language that Splunk uses. And some of the functionality that we needed in order to write our queries simply wasn't available in Elastic Search's Lucene shorthand. Additionally, there were some things that weren't obvious coming from Splunk, like especially with how Elastic Search indexes documents, it has an effect on whether or not and how you can query the actual fields that you're searching on. So this came as a surprise to us at certain points. And because of these issues, the road to Elk integration was a long one. In order to successfully complete the migration, we needed three things. Firstly, we needed a query language that would allow us to build complex queries without having to write any code. We also needed a mechanism to run these queries and email us with those results. And finally, we would like to have all this ready before we turned off Splunk because then we're then dark otherwise, and that would be really bad. All right. So as it turns out, the first half of the solution was provided to us by the data engineering team at Etsy. And that solution is called Yes Query. And what it is, it's a superset of the standard Lucene shorthand. And it's syntactically pretty similar to SPL, so it's got a bunch of pipelines everywhere that you can then take data from the first one and transfer to the second one. I'll provide an example in a bit. But more importantly, it supports all the functionality that we need. So here's a quick summary of all of these syntax. When you define an elastic search query, you do it via this large JSON DSL, and we provided the ability to inline all of these directly into the query. So you can see over here, you can specify, say, size or how you're sorting the results that come back or just what fields are coming back. Additionally, you can do a emulated join. So you can take results from one query and then insert them into a subsequent query. And all the aggregation functionality that is available in elastic search is also available in Yes Query but inline. And finally, you can also define variables within Yes Query and configure them in Form 1 and then have those variables get substituted into your queries at runtime. So you can have a list of values that you can update independently of these queries. So here's an example SPL query. What this is doing is it's finding all field login attempts and then giving you the top 10 IP addresses that made attempts. This is the same query but Ren using elastic searches DSL. And finally, this is the same query but Ren using Yes Query. So you can see it's pretty similar to how you would write it using SPL and way shorter as well. And the two are actually similar enough that someone at XE was able to write a simple query translator which we made use of during our migration. So what we did was we would just plug it in, test it out and make changes if necessary and then stick them into Form 1. Speaking of which, next up, let's talk about what Form 1 is. So Form 1 is an alert management interface or application and what it does is it allows you to write queries that get automatically executed on some sort of schedule. Then you can configure it to email you with like email you alerts whenever those data sources that you're querying return any results. And additionally, you can manage the alerts that are generated through the web interface. Before we dive into Form 1, let's talk briefly about how scheduling works within the system. So whenever a search job is run, it executes a query against a data source and then generates an alert for every single result that comes back. You can then configure a series of filters on those alerts to reduce or modify the stream somehow. And then finally specify a list of targets that you can send the remaining alerts to. So an example of one target that is pretty neat is the JIRA target which allows you to generate a ticket for every single alert that goes through the pipeline. Additionally, if we take a step back, what happens is there's a scheduler that runs periodically and generates those search jobs which then get fed off to a bunch of workers that actually execute them. And now we're ready to get into Form 1. So the first thing you'll see when you log on is the dashboard, which is this thing over here. It's pretty simple, but you see there's some useful information about the current status of Form 1. There is a breakdown of alerts that are currently active as well as a histogram of just like alerts that haven't come in over the last few days. All right, moving on. One of the most important things you'll want to do in Form 1 is manage the queries that you are scheduling to execute. And you do that via the search management page, which you can see here. The center, you've got all the searches listed out with some categorization information. And on the right, you can see the health of that particular search, whether or not it's been running correctly and whether or not it's enabled to execute. Now, if you want to modify an individual search, you'll get taken to this page over here, which has a whole, like, slew of options that you can configure. There's a title, which is not too exciting. But more importantly, there are all of these fields. So let's go through all of these briefly. At the top here is the query, which is quite simply the query that you're sending off to whatever data source. In this case, this is a log sash search. So we're sending this to an elastic search cluster with a log sash index. You can also configure... We can also configure results types, so whether or not you want the actual contents of the log lines that match the query, or whether you just want, like, a simple count, or even an indication that there's, like, no results. And finally, you can filter...you can apply thresholds on, like, how many results that you want to get back. Next up, you can also provide a description that gets included whenever an alert gets sent to you. So you should preferably, like, put some information that allows you...allows whoever's assigned to the alert to resolve it. And there are a few categorization options at the bottom as well, for the alerts that are generated. All right, next up is the frequency, which is how often you want to run this search in the time range, which is how far back of a, like, time window you want to search. Most of the time, you're going to want both these to be the same value, but if you want, say, like, better granularity, you might want to specify a frequency of one minute and a time range of ten minutes. And finally, we've got the status button, which lets you toggle this search. Cool. That's all for the basic tab. Next up, let's talk about notifications. So in Form 1, you can configure... you can configure email notifications whenever it generates any alerts, and those notifications can be sent out as soon as the alerts are generated or included in a hourly or daily rollout. You can also assign... you also have to assign these alerts to an assignee, which is the person or the group of people that are responsible for actually resolving and taking a look at those alerts. And finally, the owner field is just for bookkeeping, so you can keep track of who's responsible for maintaining that particular search. And here's the app set group that we're currently using here. You see, it's got a list of all the users that are currently on the security app set team, and whenever Form 1 generates an alert for this particular search, it'll email all these people. All right, moving on to the final tab. Here, we've got some more advanced functionality that's less commonly used, like auto-close, which allows you to automatically close alerts that haven't seen any activity after a while, so they're probably stale. And we've also got the actual configuration for filters and targets here as well. So again, recall that filters allow you to reduce the list of alerts that get passed through Form 1 and eventually get generated. And here is a list of filters that are currently available. So I'll just highlight a few of them. Ddup allows you to, like, dedupe alerts that are the same, and they can throttle the alerts that are generated to, like, some threshold. For the purposes of this presentation, let's talk about the regular expression one, because that's relatively complicated. You can configure this particular filter to have some sort of key, like what keys you want to match on within the alert, as well as a regular expression to match on. And then you can specify whether or not you want matching alerts to be included or excluded from the, like, final list of alerts. Similarly, on the other side, we've got the list of targets that you can configure. And we're going to cover the JIRA target, which allows you to specify a JIRA instance and a project, a type, and a signee. And then any alerts that make it to this target get turned into JIRA tickets. So that's useful if you want to use JIRA as your alert management workflow. Cool. So that's about it as far as managing searches go. Next up, we're going to get into actually managing the alerts that are generated by 401. So here it is, the main alert management interface. You'll notice at the top, there's a search bar for filtering the list of alerts that are visible. And this 401 actually indexes all of its alerts into Elasticsearch. So all of your standard, like, Lucene shorthand queries are valid here. In the center, you'll see all of the actual alerts that match the current filter. And you can select individual alerts and then apply actions to them using the action bar at the bottom. Now, if you want to drill down into a individual alert, you can. So this is the view for viewing just, like, a single alert. And you can see at the center, there's all the information that was available before, but also a changelog for, like, viewing all actions that have been taken on this one's alert. Additionally, you'll see there's the same action bar that's available at the bottom. And let's say... Thank you. Let's say we were to investigate this alert, like, we take a look at that IP address, and then we've determined that it's just a scanner, so nothing to worry about. We can then hit Resolve on that action bar, which will pop up this little dialogue where we can select a resolution status, in this case not an issue, but exactly what actions we took to resolve this alert. And then once you hit Resolve there, you'll see the changelog has been updated with this additional action. 401 also offers a alert feed, so what you can do is just keep this open, and whenever new alerts come in, it'll just pop up on this list. And you can also leave it running in the background because it's got desktop notifications, so you'll see that nice little chrome pop up whenever there are new alerts. Cool. All right. Next up. Thanks, Kai. I'm going to talk to you more about how we do alert management at Etsy using 401. So here we have a sample email generated by 401. I'm going to go into some more depth and explain to you what's going on. So the subject line of this email says Login Service 500s. The description says Login 500s Investigate. For people that aren't very familiar with it, Login is just basically a process to essentially log you into a website. 500s is basically a message that says, oh, something bad is happening, and usually this is pretty bad to the extent where you would want to create an alert for it and be notified about it. And we can see from the time range that this alert has taken place over the past five minutes, and we have buttons on the bottom to both view the alert in 401 as well as to be able to view this link in Kibana as well. We also get a short snippet including the PHP error that was thrown. And as you can see from this short email snippet, people are taking action based on this alert. But let's take a step back a little bit and think more about what we do to actually create high quality alerts. And at Etsy, the secret is we create alerts that have a high degree of sensitivity. What do I mean when I say high sensitivity? Well, let's say we have an alert 100 times over the course of a day. And out of those 100 times, that alert correctly predicts an event actually happening 90 times. So what that means is out of 100 times, that alert only improperly fires 10 times. So there's a 1 in 10 chance that that alert is misfiring. So 90% of the time that alert is responding correctly to an event. So we say that that particular alert has a sensitivity of 90%. That's a pretty high sensitivity that we would find to be useful. For alerts that aren't as important, we still create them as searches and alerts in 4-1-1. But what we do is we end up not generating email notifications out of them. And I'll go into more detail as to why in just a moment. For more important alerts, we still generate alerts off of them. But what we do is we set them up as roll-ups. So every hour or every day, we have this alert go off and it'll email us the results. And one reason why we really like doing this is because it gives us the option of being able to monitor a particular search over a period of time for anomalies. So one of the reasons why we take the sort of tiered approach to alerting is because attackers hitting your website will often generate a lot of noise. And in the process of doing so, they'll set off a bunch of different alerts that you have set up. So one thing that we often have to answer when we see an alert on our phone at 3 in the morning is, is this something that I really need to respond to at 3 in the morning? Can I just continue sleeping? Can I just answer this tomorrow or even after the weekend? Well, one way which we make that determination is by seeing and looking at the other alerts that have gone off in the same period of time. So we look at the high alerts, the low alerts, the medium alerts that have gone off over this period of time. An example, a good example of this would be, let's say there's a very high number of failed login attempts, a high alert that has gone off recently. Well, maybe if we also have a lower alert that indicates that we have a low quality series of bots trying to scan us at the same time, maybe that's indicative that actually this isn't like a real concentrated attack that we need to worry about so we can go back to sleep. So in addition to creating alerts, one thing that we also have to be vigilant about is maintaining our alerts. Sometimes we create alerts that overfit on a particular attacker, and as a result of that, the alerts become less useful over time. One way in which this happens is the alert simply generates too much noise. We've created this search, and it turns out where the IP address, for example, might be shared by some legitimate users as well. And that can create a bunch of false positives. So in those cases, we sometimes do that is we look at other fields. So another example is sometimes say an attacker might accidentally be using a static but very easily identifiable user agent when attacking our website. One way in which, so we can create a search off of that to easily identify that attacker, but perhaps they become a little savvier and realize that they're making this terrible mistake in the first place and they make an effort to randomize the user agent. And by doing this, what they essentially do is they're forcing us to have to use other fields to identify the attacker, maybe looking at what data centers are coming from or other IP addresses that they're coming from, for example. So let's take a step back. We've sort of sold 4-1-1 as a tool for security teams, but it's also a very useful tool for the average developer as well. And one way in which 4-1-1 can be useful for a developer is creating alerts based off of potential error conditions in your code. A good example of this would be when you want to note potential exception conditions, say, for example code wrap in a try-catch statement, for example, you generally don't want your application to be running into too many exceptions, so generally by entering in a logline and creating an alert based off that logline you'll get a notification when something bad happens in your application. Another condition under which you'd want to create an alert is when you're getting a large amount of unwanted traffic to an endpoint that you consider sensitive. A good example of this would be an attack, for example, trying to hit a gift card redemption endpoint or a credit card number entering endpoint. You know, those endpoints are probably already rate-limited in the first place, so it's only natural to add basically an additional alert on top of that just to know that someone's trying to intentionally brute-force this particular endpoint. And finally, the last instance under which you might want to consider creating alert is when you're deprecating old code. So at Etsy, we have what's called a feature flag system that allows us to very easily flag on and off particular bits of code. But sometimes we need to evaluate how often a particular code branch is being exercised before we can remove it entirely from the code base. One way in which we do that is we sometimes just like to add a logline and create an alert just to with a roll-up to see how many times this particular code branch has been exercised throughout the course of a day or even a week. And by doing that, once we have confidence in knowing, yes, this code is not really being used that often, we can go ahead and actually remove the code in question. So at Etsy, we actually have a couple different instances of 4-on-1 setup and I'll explain what they are. Our main instance that the application security and risk engineering teams use is called Sec 4-on-1. And this instance is primarily used for monitoring issues that happen on Etsy.com itself. The network security team has its own instance of 4-on-1 called appropriately NetSec 4-on-1. And this instance is set up primarily to aid in monitoring laptops and our servers. And finally, for those compliance-loving folks, we have an instance of 4-on-1 setup called SOX 4-on-1, which is primarily used for SOX-related compliance issues. Now I'm going to go into some more examples of some functionality that we have present in 4-on-1 that we're going to be making available to you when we open source the tool. A lot of this additional functionality was made at the request of developers at Etsy. And we found it useful enough to include in the open source version of 4-on-1 as well. So Kai mentioned earlier that 4-on-1 has the ability to incorporate lists into queries. Here we have a search functionality that looks for suspicious dual activity coming from known features. So this query looks fairly straightforward, but let's take a deeper look. So we're looking at logs of the type dual login, and we're looking for the IP address that matches this TorExits variable. Well, if we take a look at what the list functionality is, we can see that TorExits is defined as a URL that just enumerates a list of IP addresses. So what 4-on-1 is actually doing behind the scenes is it's taking this list to include all of those IP addresses in that TorExits node list. So essentially, when you get any hit in a log line that contains a TorExit node IP address, it matches with the search and generates an alert. Now I'm going to talk more about some of the additional functionality that we offer beyond just the Elk stack with 4-on-1. We offer a searcher for graphite, and we also offer a searcher for the type series data. This is what graphite's front and interface looks like. As you can see, it's a very nice way of easily generating graphs. This particular graph shows an overlay of potential cross-site scripting over potential scanners. It's just a really nice way of being able to determine when there are anomalies happening. And so the graphite searcher basically directly sends the query to graphite itself. All of graphite's data transform functions are available for you to be able to use for the searcher. So as an example of some of the things you can do, you can write a query to say, please fire off an alert when you see a high rate of change for failed logins. Now I'm going to talk a little bit about the HTTP searcher that we're using. This is a fairly straightforward searcher. What it does is you provide an HTTP endpoint and if you receive an unexpected response code, it creates an alert based off of that. It's very useful for web services when you want to know if a particular service is, for example, down or even up. And for those in the DevOps community, this is very similar in functionality to the tool called Nagios. Now we're going to go to the next slide. Let's hope this works. Okay, I'll be narrating this. So for this demo, we set up a very simple WordPress blog instance called demo all the things. And we have a plug-in installed called WP audit log, which logs everything that happens in this WordPress instance. In addition, we are forwarding the logs to our own Elk stack so that we can index the log files. Here I'm just showing off this blog post that we have. Red is apparently the best color. And now we're going into Kibana proper to actually look at some of the log files from this WordPress instance. And we can see here there's an interesting log line. User deactivated a WordPress plug-in. Okay, that's kind of interesting. Maybe we can make an alert off of that particular phrase that we can use for the future. So what we're going to go and do now is we're going to go into four and one proper. We're going to go into the searches tab. We're going to go and hit the create button and create a new searcher of the log stash type. And we're basically just going to create a new search to look for this particular message. We're going to call this search disabled WordPress plug-in and the query is going to look for anything in the message field that contains the phrase user deactivated a WordPress plug-in. And we're going to provide a little description in the search to let others that useful know what this search is about or generated by it in the future. We're going to look back in the past 15 minutes and we're going to test this alert. And we can see here that four and one has successfully grabbed data from log stash. So we're going to go ahead and create this search and to actually generate a real alert we're going to go ahead and hit the execute button which will not just test the alert will actually create a real alert for us in the alert page. We're going to go ahead and click back that we just got from hitting the test button. So now we're going to go into alerts. We're going to click on view to take a look at our particular the alert that was just generated. And we can see here that in the refer we in the plug-in file information we can see that the dual WordPress plug-in was disabled. That's not good. So now we've gotten the relevant information from this particular plug-in we're going to go into the plug-ins page and what do you know? Dual two factor off plug-in, the plug-in is disabled. So we're going to go ahead and re-enable it. And now that we've taken care of that issue we're going to go ahead and hit resolve and we're going to just say that we've taken action to re-enable this plug-in and we've taken care of the alert by doing that. That concludes the live demo. Not live demo. That also happens to conclude the presentation once again. For one it's going to be open source after DEF CON and we will take questions now. There's a mic over there and over there so if you've got a question please line up. Do you have a question? If you're leaving you have to leave out these doors in the back. When deciding to move away from Splunk how do you guys scale elk versus going with Splunk? So elk has a problem when it gets really big it gets really expensive. So was it a cost decision moving from Splunk? The question was why did we switch from Splunk? It was basically a decision made by our operations team. One last question what are you guys using as your send mail function? Are you guys using mail champ? We've just got everything set up correctly already so it's whatever you provide to PHP. The question was what do we use to send mail important? You have a question so you're open sourcing 411 after this talk or that's the first part and the second part is is this built on an AWS architecture such as using a simple email service, is it using Elasticsearch what is it using as far as your infrastructure that you can talk about? We're going to be open sourcing this after DEF CON and as far as email sorry was the second question email right? No is it AWS architecture so do you have an AWS architecture to go with it? Yes whatever email No I meant in general the entire because like Elasticsearch are using like Lambda functions or is it all pretty much like internal to itself instances as far as... Okay got it thanks I have a question about the configuration you showed us the beautiful UI but how is the configuration actually stored and yes there is a change log on individual pages would it be easy to version control the configuration somehow? So the question was about change log and version controlling of alerts There is no version controlling of alerts but there is a change log of all the things that have been taken on the alert so could you also speak louder because I think the mic is not great Oh okay so the initial question was how is the configuration stored is it stored in some text format that we can review XML can we version control it? We are using MySQL So we are using MySQL as our database Hello So at this point you guys are probably definitely aware of Watcher Elasticsearch's own alerting service What is the motivation between using their own plugin built in straight to the cluster At the time when we started working on this I don't think Watcher existed yet So that is why we ended up writing this So is there any point to using it now as opposed to just running the plugin? I don't want to be like that guy I don't know if you can put me on the spot So it's not just Elasticsearch you can also plug in other data sources into 401 for querying those data sources Thank you I have two questions One of them is what was your motivation to move away from Splunk and build your own So that was a decision made by our SysOps team Where they didn't really have much input on that By any security concerns they had or Did they have any security concerns around it? Yeah I don't think so I think at one point the scripting functionality was enabled by default and there were some serious security issues with that So that's as far as I can remember Just one last question Does Elk also help doing log analysis across multiple servers and instances? Or is it dedicated to just one group of... Multiple instances and have them connect to the same database and that would just work Okay Thanks Are you all open sourcing that ESQuery as well? Oh it's already out Oh it's built in My questions on Jira integration In your demo you showed that you resolved the issue with the user turning off the feature in WordPress Does that end up closing a Jira ticket? No it doesn't Jira target is pretty much separate, you just send that data off to Jira and then 401 forgets about it Okay thank you Okay so my question is a little bit twofold We saw a lot of web UI about this but there wasn't any real focus on any API around it So consider the use case where there might be something where the same type of alert happens frequently but self-resolves Would it have the possibility to either escalate the same type of alert due to its frequency or in contrast if it somehow self-resolves all the history of those alerts get resolved as well That's not currently built in but that's because it hasn't been asked for yet so like once this is open sourced you could create an issue and then we can consider it Okay thank you Cool, thanks everyone