 Hi, my name is Joshua Sacks and today with young who Lee I'll be presenting detecting handcrafted social engineering e-mails with a bleeding edge neural language model A bit about us before getting into the talk. I'm chief scientist at Sophos. Sophos is a security products and services company We have a whole range of security products ranging from firewalls to mobile security to email security To endpoint security And the team that I manage, Sophos AI is the AI shop inside of Sophos We research and develop the company's machine learning technology And we're also responsible for operationalizing that technology in service of defending about 100 million endpoints Young who Lee is one of the star researchers on my team He's almost completely solely responsible for the machine learning subsystem that protects our Android customers At least for responsible for the R&D part of that work He's also the principal author of the work presented today I'm presenting in a supporting role here I'll mostly just be setting the stage for the original work that Young who is going to describe So the problem we're solving Most generally we're working on the problem of detecting phishing e-mails More specifically we're working on detecting business e-mail compromise phishing e-mails and targeted phishing e-mails So to understand what this is it's useful to say what it's not We're not focused on detecting mass campaigns where detection reduces to a near duplicate detection problem Because attackers are just sending out millions of copies of basically the same phishing e-mail Detecting those types of e-mails turns out to be a side effect of our focus But our real focus is on detecting new custom authored bespoke phishing e-mails that are based on research on a target And this diagram does a good job of getting across the kind of workflow we're looking at stopping And in this workflow cyber criminals and attackers identify targets through open source research Usually on the web then they establish contact with those targets In step two we call this grooming In this case they put out a lure and initial e-mail usually are sometimes an initial text message And then build trust and authenticity around some identity that they're impersonating with the mark In step three they cash out the trust that they've built with their targets And make an ask of those targets often times that ask will be around wiring money or sending credentials And then in step four they actually receive money or receive the credentials So within targeted phishing business e-mail compromise which is focused on stealing money from businesses And businesses and other organizations has been a growing trend So as you can see here in July 2016 a few billion dollars were stolen according to the FBI Through business e-mail compromise attacks and these are targeted phishing attacks that extort money from organizations In May 2017 that number had grown to almost 6 billion And in 2018 that number had grown to more than 12 billion I don't have data for the last two years but I see no reason to believe that this trend has attenuated I think it's likely it's continued on a similar trajectory This is a big problem. We see this in Sophos' customer base which is a pretty large sample We also hear about it from other folks in the cybersecurity space And so we're very focused on it because it's affecting people And it's not just affecting large organizations you can see on the axis on the right That's something like 80,000 organizations had been hit by July 2018 by these attacks So there are lots of small and mid-sized organizations getting hit And oftentimes the financial damage can be in the hundreds of thousands of dollars And really impact people's lives when these attacks happened So again just to reiterate we're focused on primarily these business e-mail compromise use cases And also more generally targeted phishing in which a lot of manual labor goes into the phishing process on the criminal actor side And I think it almost goes without saying but we're focused on step two and three of these criminal actors' workflow The steps that are mediated over e-mail so we see lots of malicious e-mails exchanged in the grooming step And we also see obviously an malicious e-mail transmitted in the exchange of information step Where the attacker makes the ask of the target, usually employee And this is just because we're focused on e-mail as our signal And that's the scope of the work that I'll be talking about today Just to flesh this out a little bit, here's an example The phishing e-mail sends in the later sort of epics of the grooming stage of an attacker's workflow Here the attacker has established themselves as an impersonated chancellor of UC Berkeley They are e-mailing an employee at UC Berkeley asking if they're available Looking to exchange messages with them probably about to make an ask around money transfer Now I think it's important to highlight why phishing detection is hard And why in particular detecting new previously unseen phishing e-mails Written as part of a manual phishing campaign like what I've just described is hard And what this boils down to I think is that classical natural language processing problems are hard To get computers to understand language in any meaningful way and reason about language in any meaningful way And detecting phishing e-mails really boils down to building models that have some level of understanding of language So to understand why algorithmically it's hard to make sense of language Let's look at a few classical natural language processing problems One of those is co-reference resolution to get a sense of what this problem means Consider the following sentence, I went to the store for some milk and based on the price decided to buy it Now from a grammatical perspective it could refer to the store or to the milk It's possible that I went to the store for some milk and based on the price of the store I decided to buy it But it's more likely that I bought the milk As humans in solving this co-reference resolution problem so resolving what it refers to We have to plumb the depths of a number of complex mental models We need to deploy our syntactical model of the English language, our semantic model of the English language And our model of the world in which a person is more likely to have bought milk at the store than have bought the store itself And that's how we solve this problem Hopefully it's clear that it's hard to get algorithms to do that But to detect phishing emails we really need to understand language And this is a problem that's constituent of the problem of understanding language Word polysemi is also a classic problem in algorithmic understanding of language So a sentence like he drank a lot and was quite the rake Grammatically it's valid to interpret this sentence as meaning that he drank a lot and was quite the garden tool used to rake leaves off your lawn That's clearly not the right, that's not the sense in which the word rake is being used The word rake is being used in the sense of a drunk semi-criminal sort of disillute individual here But it takes a pretty deep exercise of a human being's mental models to arrive at the sense in which this word was used And not easy to reproduce in the form of an automated agent at a machine learning or based on rejects and rules Sentiment detection is another hard problem in natural language processing So consider the sentence I'm not angry at all, no of course Why would I be angry that you spent our life savings on your mistress So clearly the speaker is angry here and they're being sarcastic Detecting that they're angry and sarcastic is not trivial and requires that we understand not only the syntactic structure of the sentence But also the semantics of the sentence and requires that we have the reflex that this person is probably angry if their interlocutor spent their life savings on their mistress Okay so to solve phishing means that we need algorithms that can make sense of language Making sense of language is hard as these three problems demonstrate So a good solution to the phishing problem would model as intermediate steps to detecting that an email is a phishing email At some level be able to solve these problems and somewhere in the depths of this intermediate representation of the language that it's looking at The other challenge we have obviously in any cybersecurity problem, at least any detection context in cybersecurity Is that we have adversaries who'd like to bypass our detection and it's also worth considering So these are all reasons why the problem that we're presenting here is a hard one We haven't solved it completely, deserves attention from our community Now the approach that we're using to attack the phishing problem is based in neural networks and deep learning In a specific advance that happened in the last few years known as transformers So transformers or more specifically transformer blocks are a new kind of construct in neural networks Much like convolutions were new I think as of the 90s or late 80s and back propagation was a new idea in neural networks I think starting in the 80s transformers are a new idea that's come out in recent years And they help model language with depth and fidelity that seems to be genuinely new And represent a step function in our ability to model language So they're very exciting, the big idea behind the work that we're presenting today is that we're taking transformers And applying them to a cybersecurity problem which we haven't seen much of before So I'm going to talk a bit about what transformers are A detailed discussion of how they work is beyond the scope of this talk But I'm going to give some intuition and then I'll pass the mic metaphorically over to Young who will present on how we're using transformers Okay, so here's an example which I think helps to illustrate some ways in which transformers represent a real breakthrough in machine comprehension of language So in this example, and I expect some of you have seen this because this sort of went viral last year A researcher wrote a prompt, this prompt is given at the top here in a shocking finding that paragraph And then a transformer model just sort of took it from there and wrote a story based on that prompt And I think when you see the story that the machine wrote, you'll see that it betrays a understanding of the syntax and semantics of language that's quite striking So I think it's worth reading this out loud, the human written prompt reads In a shocking finding, scientists discovered a herd of unicorns living in a remote previously unexplored valley in the Andes Mountains Even more surprising to the researchers was the fact that the unicorn spoke perfect English And now again, the challenge to the model is to see if it can pick up where the human left off and write something coherent And if it can, that reflects something around its ability to solve problems like the fundamental problems I described earlier So the neural network continues, the scientist named the population after their distinctive horn, Ovid's unicorn These four horns, silver, white unicorns were previously unknown to science Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved Dr. Jorge Perez, an evolutionary biologist from the University of La Paz and several companions were exploring the Andes Mountains when they found a small valley with no other animals or humans Perez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow Perez and the others then ventured further into the valley By the time we reached the top of one peak, the water looked blue with some crystals on top, said Perez And the essay continues But what interested Jung-Hu and I in seeing examples of what transformers can do like this is that it seems that the transformers have Examples like these suggest that transformers have understood the syntax and semantics of the English language to some degree Which begs the question, could the representations that the transformers learn in their parameter structure be useful in detecting phishing emails Since these transformer models seem to have distilled knowledge of how language works and what it means within their parameter settings Okay, so this is one example of an impressive example of what transformers can do To beg the question of whether or not they could be applied usefully to phishing And we'll show results from our experiments later in Jung-Hu's section Here's another example of just how marvelous transformers models are with respect to how much more sophisticated they seem to be So here a researcher typed into this box a description of some HTML that he would like the transformer models to generate He writes, a button with the color of Donald Trump's hair And the model actually writes as a text completion some HTML and CSS It seems to understand that Donald Trump has yellow hair and it makes a button and it writes valid HTML and CSS Which is I think just impressive in the absolute sense but also for folks who've been in the NLP space for a while It's impressive relatively 5, 10 years ago if you would have shown somebody that we could do this in 2020 I think they would have been a bit incredulous This is a big step forward, applications like these represent a big step forward in natural language processing And they're due again to this new idea called transformer block That's the key building block out of which models like this model which is called GPT-3 is from OpenAI Which is an AI research lab are based Okay, so moving on So I want to go back a little bit in the history of NLP as a way of describing how transformers are new and different So I assume a substantial chunk of our audience today has studied basic machine learning on text And typically in like a machine learning 101 course you learn about the bag of words model and about discrete states, Markov models of language And I want to talk about these representations as a way of talking about how limited those representations were And then I'll talk about how transformers break us free of some of those limitations So a bag of words model of a document is a way of representing a document numerically for the purposes of machine learning In which we just count up how many times each word in that document appeared And then we create a matrix out of all the documents we have in all the words and all those documents And the entries in those matrix are just word counts So in the case on the left over here the column vectors in our matrix are documents And they get one dimension per word in our vocabulary and they get counts of how many times words appeared in that vocabulary And the entries of that particular vector And hopefully it's intuitive that once you've represented your documents in a vocabulary space And this way you can compare documents by taking some distance measure between pairs of documents You can also train machine learning models on your document corpus to say classify news articles as about sports or politics But what you've done in the first step of these models is drop out sequence information So you've forgotten about which words come in which order in the document And you've just represented your document as a bag of words Which is a useful simplifying assumption and it's one that we still use today in some of the model that we do in my research group at Sophos But it throws out a ton of information that transformers and more modern models don't throw out And the model on the right is a discrete state Markov model of language Here each word is a state and the concept of language given by the model is kind of a cheesy or an adventure story In which the next words in an utterance depends only on the current words And you're just sort of drawing from a probability distribution and moving through this graph to generate language You could never have generated anything close to that unicorn story using a Markov model And yet as recently as the last 5-10 years there's lots of papers coming out around using hidden Markov models to parse sentences and that kind of thing These are still useful models but we've gone far beyond the simplifying assumptions in these original sort of simplified models of the world in NLP from the past few decades So let's contrast transformers now and I'll get into more details about how transformers work in a second But let's contrast transformers with these earlier natural language processing approaches So pre-transformer most machine learning approaches didn't consider words in context Many approaches made the simplifying bag of words assumptions as kind of a first step in the modeling process And then ran term vectors through models like topic models or logistic regression or support vector machines Most models didn't model kind of like co-reference relationships between words or sort of which words pertain to which other words in a sentence And I'll talk about what that means later but transformers do sort of solve that problem Most approaches operated on either words or characters Typically the way people use transformers we use well-chosen chunks of words which allows us to model misspellings and this kind of thing So there's been an improvement there in the current generation of natural language processing models And older approaches tended not to use neural network technology that's changed a lot in the last 10 years But transformers leverage some of the best ideas that have come out of the neural network revolution that's been ongoing since around 2012 So transformers kind of kick apart a number of log jams in natural language modeling They give words contextual representations, they model attention like the relationship between words in a document They use these smart partial word representations that allow for misspellings and just the tower of babble of vernaculars that appear under the banner of say the English language on the internet And they take advantage of ideas like residual connections and modern optimizers and many of the really good ideas that have come out of the neural network revolution So these are all reasons why we wanted to test their applicability to phishing detection So if you want to get into detail about how transformers work I'd recommend this blog post by Jay Alamar It's where I created this screenshot here I'm not going to get into the details of the series of matrix multiplications that comprise summations and various linear algebra and operations that comprise a transformer block I just want to give a little bit of intuition here before passing the mic over to Yonghu The basic idea behind a transformer block which is kind of a Lego block out of which you build a transformer based neural network is We're taking in a sequence of words This diagram shows a very simple example where we just take in a sequence of two words Typically we take in a larger window like 512 words We pass them into the block And the way they get passed in is not just as entries in a term vector matrix but actually as vectors themselves The words get a vector representation, these are known as embeddings And we pass them both into the transformer network The first thing that the transformer network does to our input word sequence is model the attention Relationships between the words So basically in a two word case it's a little bit harder to describe here But basically if you had like a sequence of 15 words For every word the attention mechanism would compute how much attention, how much that word pertains to the other 14 words in the sentence And you'll see how that has a relationship with the co-reference resolution problem I talked about earlier But it's just intuitive that there's a graph of word relationships in a sentence And self-attention kind of models that in terms of which subjects pertain to which objects Which pronouns pertain to which people etc There's an addition and normalization step that happens when we've sort of run this self-attention process a number of times Typically we don't just do self-attention once, we have a number of what are called heads and we run attention a number of times We combine all that together, we do some non-linear transformation on it And then we wind up with a new embedding of our original sequence of the same dimensionality as this original embedding Except that now thinking is encoded in this new representation in the context in which it appears In the context of machines, machines is encoded in the context of thinking And then typically we stack these transformer blocks that we actually do We have another transformer block that then sort of refines the representation and we keep going Young who we'll show that we use a number of transformer blocks in our phishing detection work in a few minutes Here's some intuition as to what comes out of the attention mechanism in a typical transformer So here's what a transformer block has decided it sort of relates to in an input sentence So here we have the input sentence, the animal didn't cross the street because it was too tired And the strongest attentional relationship here goes to the animal Which is interesting because one could ask whether or not it refers to the street or the animal here One can interpret the weight of the connection to the animal meaning that the transformer block has decided In their quotes that it pertains to the animal which is really interesting So hopefully you get some intuition as to how powerful this attentional representation is And how important it is in machine learning models getting some what we might call understanding Over the language that they're analyzing Okay, so I want to put the intuition together around how transformers pertain to the work that Young who and I are presenting today So basically what we're going to do in our phishing model is embed an email as a sequence of embedded character sequence vectors And then we're going to run that through a series of transformer blocks like what I just showed That are going to create a very refined attentional representation of the word sequences And then produce these contextual embeddings that get at the meaning of the words and the context in which they appear And then finally our network is going to solve the classification tasks Say whether or not the email is a phishing email or not How all that magic works, how the network gets trained which there's some tricks there that are really cool I'll leave to Young who and hope a lot of this makes sense And happy to take questions about my piece of this presentation later at the end of the talk Thank you, Yoshi Let me continue the second part of our talk The second part will include our design decisions for Catbot and performance result Catbot is the name of our email model, context aware tinybot The model size is tiny but it is mighty but Modern NLP models all have a nice and friendly name For example, Elmo was introduced in 2018 The model used by directional LSTMs to generate contextualized word embeddings And then same year later, Google researchers introduced bot and achieved the state of art performance in many English understanding problems Next year 2019, Baudu researchers introduced ONI and the model achieved another state of art performance in many Chinese language understanding problems They are all popular characters from Sesame Street This year 2020, we introduced Catbot to tackle email security problems Transformer-based NLP models are powerful but they are complex and heavy And it is challenging to deploy heavy models for real-time applications So our first design goal is to convert the heavy model into lightweight model So we can reduce number of parameters and then we can improve in front speed We downsize a baseline model called digitalbot which has six transformer blocks We take half of transformer blocks from pre-trained model and then replace missing transformers with simple adapters For example here, we take transformer 135 and then we added two adapters Also we can take other number of transformer blocks This approach allows simply significantly reduce number of parameters The second goal is to improve model performance by combining additional input Standard NLP models only accept tax data as input However, we can extract additional features from email headers and we can use the additional input to our model So the tax input will be the input to the embedding and then we can add additional input to the classification adder And we added additional dense layers in the classification to combine the input from transformer block and another heavy-weighted input With the additional input, we improved our model's performance further Let me talk about the details of our adapters We inserted two adapters here and each adapter has quite a simple architecture Each adapter will have two dense layers and there is one non-linear activation unit in between and we have a skip connection The dimensionality of the dense unit are same as output of transformer blocks And the two dense layers are initialized with near zero values So as initially the adapters will act as identity blocks But however, they will gradually change the data from the low transformer block to upper transformer block And to minimize classification loss We also modified the standard fine-tuning method by using a partial fine-tuning method Standard fine-tuning involves updating all parameters jointly However, partial fine-tuning will only update upper blocks but we fixed low blocks For example here, the low blocks embedding the transformer 1 and 3 are fixed But we update the adapter 1 and 2 and transform 5 and classification adder This operation was to minimize forgetting problems of learned presentations from low transformer blocks As mentioned earlier, we have two set of features One from text data and another one from email headers We can use multiple email header builders for example from to CC reply Builders to extract additional context information And we consider the subject and the text as context input The first set of features from email text, content features We extract text data from subject and plain text body If only HTML content is available, then we can also extract plain text from HTML data using a HTML puzzle For example, this one is a simple HTML hyperlink But we only extract visible text visit bank site as output And then from the extract text, we remove less informative characters We remove digit and function characters The second example, there are many high points but the high points will be removed in this step Otherwise, each high point will be individual token Finally, we select 120 tokens as input to the transformer blocks We use a subword tokenized called word piece The word piece tokenized can overcome some of the limitations from character level or word level tokens Character level tokens are too fine grained so it is hard to recognize word boundaries and meaning of words And word level tokens often have out of vocabulary problems The subword token tokenized can split complex or uncommon words into sub-word tokens For example, cat word can be divided into simple cat and word And double hash is an indicator for sub-word Similarly, so forth can be divided into three tokens SO, PH and OS The subword tokenized reduce number of unknown tokens in our email data With the selected tokens, the tokens will be input to the embedding layer And then we have three transformer blocks And each transformer block has 12 multi-head attention layers And each individual attention layer learns contextual relationship between tokens And the lighter diagram shows attention weight between tokens The transfer token has multiple attention weight for other tokens In our email data, we have many non-English emails And also non-English emails can include English words Also English emails can include non-English words The non-English emails account for 25% of our total P9 and British emails So we need to support multi-lingual model which can deconize different languages How we can support multi-lingual emails? The solution is bot comes with two versions, English and multi-lingual bot The English version was pre-trained with large English text datasets including Wikipedia And has 30,000 English tokens The multi-lingual version was pre-trained with large text datasets from more than 100 languages And this version has four times large vocabulary size which is 120 tokens Which will cover many unicode characters and unicode words So we find turned a multi-lingual bot for our multi-lingual emails The second set of features from email headers We can extract multiple indicators from email headbuilders For example, the first one, we check whether the emails are from internal or external We can compare the domain of recipient and sender For example here, the domain name is similar looking one but actually there are extra apps So this one we consider as external email And then we can also use external reply by comparing the domain of from and reply to Open targeted phishing attackers use different domains for reply to And also we collect size of recipients and the size of the carbon copy recipients And additional indicators It is obvious that targeted phishing attacks will have only single recipient However, many user phishing emails will have multiple recipients or carbon copy recipients Next, we will have a look at the performance of our catbot Let's have a look at the performance of catbot We use a data set of 10 million P9 samples And the data set also includes 350 phishing emails and 1000 BC emails We use time split to allocate 70% of samples as training and remaining 30% for test samples We included two baseline models compare catbot The first one is distilbot which has six transformer blocks And another one is LSTM long short memory The model is a recurrent neural network architecture Which also used bot's same embedding layer We trained three models on a GPU instance from AWS And we assigned high sample weight for the PC samples And over sample minor class version samples to allocate balance the samples in each minibatch To compare performance, we use ours curves and area on the curve Also, we compare inference speed and the model size as key performance metrics These ours curves compare our catbot model with two baseline models The top blue one is catbot and second one is distilbot and bottom green one is LSTM Our model are performed two baseline models And our model achieved 0.82 positive rate at 0.1% post-positive rate Next, we compare the performance when we remove adapters and context input rated layers The top one is catbot and the second orange one is when we remove adapters And the bottom one is when context input was removed from catbot We can see significant performance drop when we remove either adapter or context input Which demonstrate we can improve performance by using additional adapters and context rated layers Next, we compare the performance of three models with targeted PC samples We assigned high sample rate for PC samples and we achieved high performance for detecting those PC samples And the catbot of the two baseline models Next, we compare performance for phishing emails We divide the phishing emails into two groups, English and non-English emails Our catbot are performed for English and non-English emails And our model was based on the multilingual bot so we can see significant performance when we use the model for detecting non-English emails When it compares with simple LSTM model Next, we compare the input speed The catbot has six transformer blocks and the catbot has three transformer blocks So we achieved two times speed up in impulse time when we missed the performance on a CPU machine As the number of blocks decreases, the impulse time can be reduced Next, model size For comparison, we divide the model size into two parts, embedding and transformer blocks The digital bot has six transformers and has 92 million parameters for embedding and 42 million parameters for transformer blocks And the catbot has six transformer blocks We reused the same embedding parameters but we reduced the number of parameters for transformer blocks by 50% in total Our model size is 85% of baseline model When we apply the same mechanism for English version, the English catbot will have 71% of parameters from baseline model Next, we will inspect how catbot generates outputs We use a line method to interpret our predictions LIME is a local interpretable model egotastic explanation method which can be applied to any black box models We can understand a model by protruding the input and understanding how the predictions change The first LIME example is for a B9 email The prediction score for this email is close to zero And we highlight legitimate tokens with blue color and malicious ones as orange one And this one we don't have any high-rated tokens for malicious ones Next, we have a BC sample The model prediction score for maliciousness is close to one And the model recognizes transfer and urgent payment as high-rated tokens Next, we have another BC sample which is related with gift card And the model prediction score is close to one and card and urgently the tokens are high-rated for this email Next, we have two handcrafted social engineering emails They look quite different but if we read the text carefully, they are asking the same wire transfer And the model predictions are close to one for both emails and the highlight tokens are payment And sweep it as soon as possible This example demonstrates our model's ability to understand complex texts and conceptually similar emails can be identified In conclusion, our catboard is a carefully re-architectural transform based model With this architecture, we achieve both high speed and high accuracy in detecting handcrafted social engineering email attacks In the future, we want to apply the same design decisions to new GPT-3 model Thank you. Do you have any questions?