 Hi everybody and welcome to practical deep learning for coders lesson four Which I think is the lesson that a lot of the regulars in the community have been most excited about because it's where we're going to get some Totally new material totally new topic. We've never covered before We're going to cover natural language Processing in LP and you'll find there. There is indeed a chapter about that in the book But we're going to do it in a totally different way to how it's done in the book in the book We do NLP using the fast AI library using recurrent neural networks RNNs Today we're going to do something else which is we're going to do Transformers and we're not even going to use the fast AI library at all in fact So what we're going to be doing today is we're going to be Fine-tuning a pre-trained NLP model using a library called hugging face transformers Now given this is the fast AI course you might be wondering why we'd be using a different library other than fast AI The reason is that I think that It's really useful for everybody to have experience and practice of using more than one library Because you'll get to see the same concepts Applied in different ways and I think that's great for your understanding of what these concepts are Also, I really like the hugging face transformers library. It's absolutely the state of the art in NLP and it's well worth knowing If you're watching this on video by the time you're watching it We will probably have completed our integration of the transformers library into fast AI So it's it's in the process of becoming the main NLP Kind of foundation for fast AI. So you'll be able to combine Transformers and fast AI together Yeah, so I think there's a lot of benefits To this and in the end you're going to know how to do an LP in a really fantastic library. Now the other thing is Hugging face transformers Doesn't have the same layered architecture that fast AI has which means particularly for beginners The kind of high level height, you know top tier API that you'll be using most of the time is not as Kind of ready to go for beginners as you're used to from fast AI and So that's actually I think a good thing. You're up to lesson four You know the basic idea now of how gradient descent works and and you know how parameters are Learned as part of a flexible Dutch function. I think you're ready to To try using a somewhat lower level library that does a little bit less for you So it's going to be you know a little bit more work It's still it's a very well-designed library and it's still Reasonably high-level, but you're going to learn to go a little bit deeper And that's kind of how the rest of the course in general is going to be on the whole is we're going to get a bit deeper and a bit deeper and a bit deeper Now so first of all, let's talk about What we're going to be doing with fine-tuning a pre-trained model we've talked about that in passing before But we haven't really been able to describe it In any detail because you haven't had the foundations now you do You played with these sliders last week that and hopefully you've all Actually gone into this notebook and dragged them around and tried to get an intuition for like this idea of like Moving them up and down makes the loss go up and down and so forth so I mentioned that Your job was to move these sliders to get this as nice as possible But when it was given to you The person who gave it to you said oh Actually slider a that should be on 2.0 We know for sure and slider B We think it's like around two and a half Slider see we've got no idea Now that'd be pretty helpful wouldn't it right because you could like immediately start Focusing on the one we have no idea about get that and roughly the right spot And then the one you kind of got a vague idea about you could just tune it a little bit and the one that they said Was totally confident you wouldn't move it all you would probably tune these sliders really quickly That's what a pre-trained model is a pre-trained model is A bunch of parameters that have already been fit Where some of them you're already pretty confident of what they should be and some of them We really have no idea at all and so fine tuning is the process of taking those ones We have no idea what they should be at all and trying to get them right and then moving the other ones a little bit the idea of fine-tuning a Pre-trained NLP model in this way was pioneered by an algorithm called URLM fit Which was first presented actually in a fast AI course. I think the very first fast AI course It was later turned into an academic paper by me and in conjunction with a then PhD student named Sebastian Ruder Who's now one of the world's top NLP researchers and went on to help inspire a huge Change, you know huge kind of step improvement in NLP capabilities around the world along with a number of other important innovations at that time This is the basic process that URLM fit described Step one was to build something called a language model using basically nearly all of Wikipedia and What the language model did was it tried to predict the next word of a Wikipedia article? in fact every next word of every Wikipedia article Doing that is very difficult, you know, there are a Wikipedia articles which would say things like You know the 17th prime number is or The 40th president of the United States blah said at his residence blah that You know filling in these kinds of things requires understanding a lot about How language is structured and about the world and about math and so forth So to get good at being a language model a neural network has to get good at a lot of things It has to understand how language works at a reasonably good level and it needs to understand what it's actually talking about and What is actually true or what is actually not true and the different ways in which things are expressed and so forth So this was trained Using a very similar approach to what we'll be looking at fine-tuning, but it started with random weights and at the end of it There was a model that could predict More than 30% of the time correctly what the next word of a Wikipedia article would be So in this particular case for the ULM fit paper We then took that and we were trying to the first task I did actually for the the fast AI course back when I invented this Was to try and figure out whether IMDB movie reviews were positive or negative sentiment Did the person like the movie or not? So what I did was I Created a second language model. So again the language model here is something that predicts the next word of a sentence but rather than using Wikipedia I took this pre-trained model that was trained on Wikipedia and I Ran a few more epochs using IMDB movie reviews So it got very good at predicting the next word of an IMDB movie review And then finally I took those weights and I fine-tuned them for the task of predicting whether or not a movie review was positive or negative sentiment. So those were the three steps This is a Particularly interesting approach because this very first model and for the first two models if you think about it They don't require any labels. I didn't have to collect any kind of document categories or do any kind of Surveys or collect anything all I needed was the actual text of Wikipedia and movie reviews themselves Because the labels was what's the next word of a sentence? Now since we built ULM fit And we used RNNs or current neural networks for this At about the same time ish that we released this a new kind of architecture particularly useful for NLP at the time Was developed called transformers And transformers were particularly built because they can take really good advantage of modern accelerators like like Google's TPUs They Didn't really Kind of allow you to predict the next word of a sentence It's just not how they're structured for reasons we'll talk about probably in part two of the course So they threw away the idea of predicting the next word of a sentence and then they instead they did something just as good and pretty clever they took kind of Chunks of Wikipedia or or whatever text they're looking at and deleted at random a few words and Asked the model to predict which what were the words that were deleted essentially, so it's a pretty similar idea Other than that the basic concept was the same as ULM fit They replaced the RNN approach with a transformer model They replaced our language model approach with what's called a masked language model, but other than that the basic idea was the same So today we're going to be looking at models using What's become? The you know much more popular approach than ULM fit which is this transformers mass language model approach Okay, John. Do we have any questions and I should mention we do have Professor from University of Queensland John Williams joining us Who will be asking the highest voted questions from the community? What do you got John? Yeah, thanks Jeremy look and we might be jumping the gun here I suspect this is where you're going tonight, but we've got a good question here on the forum Which is how do you go from a model that's trained to predict the next word to a model that can be used for classification? Sure, so yeah, we will be getting into that in more detail and in fact, maybe a good place to start would be the next slide Kind of give you a sense of this You might remember in lesson one. We looked at this fantastic Xyle and Fergus paper where we looked at visualizations of the first layer of a image net classification model and layer one had Sets of weights that found diagonal edges and here are some examples of bits of photos that Successfully matched with an opposite diagonal edges and kind of color gradients and here's some examples of bits of pictures that matched and then layer two Combined those and now you know how those were combined, right? These were rectified linear units that were added together Okay, and then sets of those rectified linear units the outputs of those they're called Activations were then themselves run through a matrix motor player rectified linear unit added together So now you don't just have to have edge detectors, but layer two had corner detectors And here's some examples of some corners that that corner detector successfully found you remember these were not engineered in any way they just Evolved from the gradient descent training process Layer two had examples of circle detectors as it turns out and skipping a bit by the time we got to layer five we had bird and lizard eyeball detectors and dog face detectors and Flower detectors and so forth now You know nowadays you'd have something like a resident 50 would be something you'd probably be training pretty regularly in this course So that you know you've got 50 layers not just five layers Now the later layers Do things that are much more specific To the training task, which is like actually predicting really what what is it that we're looking at the early layers Pretty unlikely you're going to need to change them much as long as you're looking at like some kind of natural photos Right if you're going to need edge detectors and gradient detectors So what we do In In the fine tuning process is there's actually one extra layer after this which is the layer that actually says what is this You know, it's it's a dog or a cat or whatever. We actually delete that we throw it away. So now That last matrix multiply has one output or one output per category. You're predicting we throw that away So the model now has that the last matrix. It's spitting out You know, it depends but generally a few hundred activations And what we do is as we'll learn more shortly In the coming lesson we we just stick a new Random matrix on the end of that and that's what we initially trained. So it learns to use these kinds of features to Predict whatever it is you're trying to predict and then we gradually train all of those layers So that's basically how it's done. And so that's a bit hand-wavy, but we'll in particularly in part two actually build that from scratch ourselves and In fact in this lesson time-permitting. We're actually going to start going down the process of actually building a real-world Deep neural net in Python. So we'll be starting to actually make some progress towards that goal Okay, so Let's jump into the notebook So we're going to look at a Kaggle competition. That's actually on as I as I speak And I created this notebook called getting started with an LP for absolute beginners and so the competition is called the US patent phrase-to-phrase matching competition and So I'm going to take you through You know a complete submission To this competition and Kaggle competitions are interesting particularly the ones that are not playground competitions But the real competitions with the real money applied They're interesting because this is an actual project that an actual organization is prepared to invest money in getting solved using their actual data So a lot of people are a bit dismissive of Kaggle competitions as being kind of like Not very real and it's certainly true. You're not worrying about stuff like productionizing the model But you know in terms of like getting Real data about a real problem that real organizations really care about and a very direct way to measure the you know Accuracy of your solution you can't really get better than this All right, so this is a good place a good competition to to experiment with for trying NLP Now as I mentioned here probably the most widely useful application for NLP is Classification and as we've discussed in computer vision classification refers to taking an object and trying to identify a Category that object belongs to so previously. We've mainly been looking at images today. We're going to be looking at Documents now in NLP when we say document We don't specifically mean, you know a 20 page long You know essay a document could be three or four words Or a document could be the entire exacto pedia. So a document is just an Input to an NLP model that contains text now Classifying a document so deciding what category a document belongs to is surprisingly rich Thing to do. There's all kinds of stuff you could do with that. So for example, we've already mentioned sentiment analysis That's a cat. That's a classification task. We're trying to decide on the category positive or negative sentiment Author identification would be taking a document and trying to find the category of author Legal discovery would be taking documents and putting them into categories according to in or out of scope for a court case triaging inbound emails would be putting them into categories of you know Throw away send to customer service send to sales, etc. Right, so Classification is a very very rich area and for people interested in Trying out an LP in real life. I would suggest classification would be the place I would start for looking for kind of Accessible real-world useful problems you can solve right away. Now The Kaggle competition does not immediately look like a classification competition What it contains? Let me show you some data What it contains is data that looks like this It has a thing that they call anchor. I think they call target. I think they call context and a score now these are I Can't remember exactly tells but I think these are from patents and I think on the patents there are various like Things they have to fill in in the patent And one of those things is called anchor One of those things is called target and in the competition the goal is to come up with a model that automatically determines which Anchor and target pairs are talking about the same thing So a score of one here Wood article and wooden article obviously talking about the same thing the score of zero here abatement and forest region Not talking about the same thing. So the basic idea is that we're trying to Guess the score and it's Kind of a classification problem kind of not we're basically trying to classify things into either These two things are the same or these two things aren't the same It's kind of not because we have not just one and zero but also point two five point five and point seven five There's also a column called context Which is I believe is like the category that this patent was filed in and My understanding is that whether the anchor and the target count as similar or not depends on You know what what the patent was filed under? so How would we take? this and Turn it into something like a classification problem so the suggestion I make here Is that we could basically Say okay, let's put the you know some Some constants during like text one or field one before the The first column and then something else like text two before the second column Oh, and maybe the also the context I should have this well text three in the context And then try to choose a category of meaning similarity different similar or identical so you can basically concatenate those three pieces together Call that a document and then try to train a model that can predict these categories that would be an example of how we can take this basically similarity problem And turn it into something that looks like a classification problem and we tend to do this a lot in deep learning is we kind of take Problems that look a bit novel and different and turn them into a problem that looks like something we recognize Right, so on Kaggle This is a you know larger data set that you're going to need a gpu to run So you can click on the accelerator Button and choose gpu to make sure that you're using a gpu If you click copy and edit on my document, I think that'll happen for you automatically Personally, you know, I like using things like paper space generally better than Kaggle like Kaggle's pretty good, but you know you only get 30 hours a week of gpu time and their notebook Editor for me is not as good as the real Jupiter lab environment So there's some information here. I won't go through but it basically describes how you can Download stuff to paper space or your own computer as well if you want to so I basically Create this little Boolean always in my notebooks called is Kaggle Which is going to be true if it's running on Kaggle and false otherwise and any little changes I need to make I'd say if is Kaggle and put those changes So here you can see here if I'm not on Kaggle And I don't have the data yet Then download it and Kaggle has a little api It's quite handy for doing stuff like downloading data and uploading notebooks and stuff like that submitting to competitions If we are on Kaggle then the data is already going to be there for us Which is actually a good reason for Beginners to use Kaggle is you don't have to worry about grabbing the data at all It's sitting there for you as soon as you open the notebook Kaggle has a lot of Python packages installed But not necessarily all the ones you want and at the point I wrote this They didn't have hugging faces datasets Package for some reason so you can always just install stuff So you might remember the exclamation mark Means this is not a python command, but a a shell command a bash command But it's quite neat. You can even put bash commands inside python conditionals So that's a pretty cool little trick in notebooks Another cool little trick in notebooks Is that if you do use a bash command like ls But you then want to insert the contents of a python variable just chuck it in parentheses So I've got a python variable called path And I can go ls path in parentheses and that will ls the contents of the python variable path So there's another little trick for you All right, so when we ls that we can see that there's some csv files. So what I'm going to do is kind of take you through Roughly the process the kind of process I You know went through as you know when I first look at a competition So the first thing is like already data set indeed. What's in it? Okay, so it's got some csv files You know as well as looking at it here the other thing I would do Is I would go to the competition website And if you go to data A lot of people skip over this which is a terrible idea because it actually tells you What the dependent variable means what the different files are what the columns are and so forth So don't just rely on Um looking at the data itself, but look at the information that you're given about the data So for csv files so csv files are comma separated values. So they're just text files with a comma between each field And we can read them using um pandas Which for some reason always always called pd Um pandas is one of Um, I guess like I'm trying to see probably like four key libraries That you have to know to do data science in python And specifically Those four libraries are NumPy Um matplotlib pandas and py torch So numpy is what we use for basic kind of uh A numerical programming Matplotlib we use for plotting Pandas we use for tables of data and py torch we use For deep learning those are all covered In a fantastic, uh book By the author is pandas Which the um, the new version is actually available for free, I believe Python for data analysis, so if you're not familiar with these libraries Just read the whole book. It doesn't take too long to get through and it's got lots of cool tips and it's very readable I do find a lot of people doing this course Often I see people kind of trying to jump ahead And and want to be like oh, I want to know how to like create a new architecture or Build a speech recognition system or whatever and but it then turns out that they don't know how to use these fundamental libraries So it's always good to be bold and be trying to build things But do also take the time to you know, make sure you finish reading the first a i book and read at least Where's mckinney's book? That that would be enough to really give you all the basic knowledge you need I think So with pandas we can read a csv file and that creates something called a data frame Which is just a table of data as you see So now that we've got a data frame Um, we can see what we're working with Uh, and when we ask uh when in jupiter, we just put the name of a variable containing a data frame We got the first five rows the last five rows and the size so we've got 36,473 rows Okay So um other things I like to use for understanding a data frame is uh the described method If you pass include equals object that will describe that will describe basically all the kind of the string fields the non-numeric fields So in this case, there's four Of those And so you can see here that that anchor field we looked at there's actually only 733 unique values. Okay, so this thing you can see that there's lots of repetition out of 30 36 000 So there's lots of repetition Um, this is the most common one it appears 152 times And then context we also see lots of repetition. There's 106 of those contexts So this is a nice little method. We can see a lot about the data in in a glance And when I first saw this in this competition, I thought well, this is actually Not that much language data when you think about it. The you know each document is very short, you know three or four words really and lots of it is repeated So that's like as I'm looking through it. I'm thinking like what are some key features of this data set and that would be something I'd be thinking wow, that's you know, we've got to do a lot with not very much unique data here So here's how we can just go ahead and create a single string like I described Which contains? You know some kind of field separator plus the context the target and the anchor So we're going to pop that into a field called input Something slightly weird in pandas is there's two ways of referring to a column You can use square brackets and a string to get the input column Or you can just treat it as an attribute When you're setting it you should always use the form seen here When reading it you can use either I tend to use this one because it's less typing So you can see now we've got this These concatenated rows. So head is the first few rows So we've now got some some documents to do an LP with Now the problem is as you know from the last lesson Neural networks work with numbers, right? We're going to take some numbers And we're going to multiply them by matrices We're going to replace the negatives with zeros and add them up. We're going to do that a few times That's our neural network with some little wrinkles, but that's the basic idea So how on earth do we do that? For these strings So there's basically two steps we're going to take The first step is to split each of these into tokens Tokens are basically words. We're going to split it into words There's a few problems with splitting things into words though The first is that some languages like Chinese don't have words All right, or at least certainly not space-separated words and in fact in Chinese it sometimes It's a bit fuzzy to even say where a word begins and ends And some words are kind of not even the pieces are not next to each other Another reason is that what we're going to be doing is after we've split it into words or something like words We're going to be getting a list of all of the unique words that appear just called the vocabulary And every one of those unique words is going to get a number As you'll see later on the bigger the vocabulary The more memory is going to get used The more data we'll need to train In general, we don't want a vocabulary to be too big so instead Nowadays people tend to tokenize into something called subwords which is pieces of words. So I'll show you what it looks like so the process of turning it into Smaller units like words is called tokenization and we call them tokens instead of words A token is just like the more general concept of like whatever we're splitting it into So we're going to get Hugging face transformers and hugging face data sets doing our work for us And so what we're going to do is we're going to turn our pandas data frame into a Into a hugging face data sets data set It's a bit confusing PyTorch has a class called data set And hugging face has a class called data set and they're different things Okay, so this is a hugging face data set hugging face data sets data set So we can turn a data frame into a data set just using the from pandas method And so we've now got a data set So if we take a look it just tells us all right. It's got these features okay And remember input is the one we just created with the concatenated strings And here's those 36000 rows Okay, so now we're going to do these two things tokenization, which is to split each text up into tokens And then numericalization which is to turn each token into its unique id based on where it is in the vocabulary The the vocabulary remember being the unique the list of unique tokens Now Particularly in this stage tokenization there's a lot of Little decisions that have to be made The good news is you don't have to make them Because whatever pre-trained model you used the people that pre-trained it made some decisions And you're going to have to do exactly the same thing Otherwise you'll end up with a different vocabulary to them and that's going to mess everything up So that means before you start tokenizing You have to decide on what model to use Hugging face transformers is a lot like Tim. It has a library of I believe hundreds of models I guess I shouldn't say hugging face transformers. It's really the hugging face model hub 44 000 models so even many more even than Tim's image models And so these models they vary in a couple of ways. There's a variety of different architectures Just like in Tim But then something which is different to Tim is that each of those architectures can be trained on different Corpuses for solving different problems. So for example, I could type patent And see if there's any pre-trained patent there is. Okay, so there's a patent. There's a whole lot of pre-trained patent models, isn't that amazing so Quite often thanks to the hugging face model hub You can start your pre-trained model with something that's actually pretty similar to To what you actually want to do or at least was trained on the same kind of documents Having said that there are some just generally pretty good models that work for a lot of things a lot of the time and Deberta v3 Is is certainly one of those Um This is a very new area nlp has been like practically Really effective for you know general users For only a year or two Where else for computer vision, it's been quite a while So you'll see you'll find that like a lot of things aren't as quite well better down. I don't have a picture to show you Of which models are the best or the fastest and the most accurate and whatever right this a lot of this stuff is like Stuff that we're figuring out as a community using competitions like this in fact, so this is one of the first nlp competitions actually in the kind of modern nlp era So, um, you know, we've been studying these competitions closely and yeah, I can tell you that Deberta is actually A really good starting point for a lot of things. So that's why we've picked it So we pick our model and just like in tim for image, you know Our models is often going to be a small or medium a large And of course we should start with small Right because small is going to be faster to train. We're going to be doing able to do more iterations And so forth, okay So at this point remember the only reason we picked our model is because we have to make sure we tokenize in the same way To tell transformers that we want to tokenize the same way that the people that built a model did We use something called auto tokenizer. It's nothing fancy. It's basically just a dictionary which says oh, which model uses which tokenizer So when we say auto tokenizer from pre-trained, it will download the vocabulary and the details about how this particular model Uh tokenized data set So at this point we can now take that Tokenizer and pass the string to it So if I pass the string g'day folks on Jeremy from fast.ai You'll see it's kind of putting it into words kind of not So if you've ever wondered whether g'day is one word or two You know, it's actually three tokens according to this tokenizer And i'm is three tokens And fast.ai is three tokens. This punctuation is a token So you kind of get the idea These underscores here That represents the start of a word Right, so that's kind of there's this concept that like the start of a word is kind of part of the token So if you see a capital i in the middle of a word versus the start of a word, that's kind of means a different thing So this is what happens when we tokenize this sentence Using the tokenizer that the deburrta v3 developers used So here's a Less common Unless you're a big platypus fan like me less common sentence A platypus is an ornithorhynchus sanatinas And so okay in this particular vocabulary platypus got its own word. I don't token but ornithorhynchus didn't And so I still remember grade one For some reason our teacher got us all to learn how to spell ornithorhynchus So one of my favorite words So you can see here it's been split into ornithorhynchus So every one of these tokens you see here Is going to be in the vocabulary right the list of unique tokens that was created when this When this particular model this pre-trained model was first trained So somewhere in that list we'll find underscore capital a And it'll have a number And so that's how we'll be able to turn these into numbers so this first process is called tokenization and then the thing where we take these tokens and turn them into numbers is called numericalization so our data set remember we put our string into the input field So here's a function that takes a document grabs its input and tokenizes it Okay, so we'll call this our tokenization function Tokenization can take a minute or two So we may as well get all of our processes used doing it at the same time to save some time So if you use the data set dot map It will parallelize that process and just pass in your function Make sure you pass batch decals true so it can do a bunch at a time And behind the scenes this is going through something called the tokenizes library, which is a pretty optimized rust library that uses You know SIMD and parallel processing and so forth. So with batch decals true, it'll be able to do more stuff at once So look it only took six seconds. So pretty fast So now when we look at a row of our tokenized data set, it's going to contain exactly the same as our original data set No, sorry, it's not going to take exactly the same as the original data set It's going to contain exactly the same input as our original data set and it's also going to contain a bunch of numbers these numbers are the position in the vocabulary of each of the tokenized Each of the tokens in the string So we've now successfully turned a string into a list of numbers So that is a great first step So we can see how this works We can see for example that we've got of At this a separate word. So that's going to be underscore o f in the vocabulary We can grab the vocabulary look up of Find that it's 265 And check here. Yep. Here it is 265 Okay, so it's not rocket science, right? It's just looking stuff up in a dictionary to get the numbers Okay, so that is The tokenization and numericalization necessary in nlp to turn our documents Into numbers to allow us to put it into our model Any questions so far john Excuse me. Yeah, thanks Jeremy So there's a there's a couple and this seems like a good time to throw them out and it's related to how you've Formatted your input data into these Sentences that you've just tokenized Yeah, um, so one question was really about how you choose those keywords Oh, yeah, and the the the order of the fields that you You know, so so I guess just you know interested in an explanation. Is there any is it is it more art or science? How you know it's arbitrary. I tried a few things. I tried x, you know, I tried putting them backwards, you know, doesn't matter um We just want some way Something that it can learn from right. So if I just concatenated it without These headers before each one It wouldn't know where abatement of pollution ended and where abatement started, right? So I did just something that it can learn from this is a nice thing about neural nets. They're so flexible As long as you give it the information somehow It doesn't really matter how you give it to give it the information as long as it's there, right? I could have used punctuation. I could have put like, I don't know one Semi colon here and two here and three here. Yeah, it's not a big deal like At the level where you're like trying to get an extra half a percent to get up the leaderboard on cargo competition You may find tweaking these things makes tiny differences, but in practice You won't generally find it it matters too much Right, thank you. Um, and I guess the second part of that Excuse me again Somebody's asking if one of their their fields was a particularly long say it was a thousand characters Is there any special handling required there? Do you need to Do you need to re-inject those kind of special marker tokens? Does it does it change if you've got much bigger fields that you're trying to learn and query? Yeah Long documents and urlm fit Are required no special consideration So i am db. In fact has multi thousand word Movie reviews and it works great To this day urlm fit is probably the best approach You know for for reasonably quickly and easily using large documents Otherwise if you use transformer based approaches large documents are challenging specifically Um transformers has to basically have to do the whole document at once Where else urlm fit can split it into multiple pieces and read it gradually And so that means you'll find that people trying to work with large documents tend to spend a lot of money on gpu's Because they need the big fancy ones with lots of memory So yeah, generally speaking I would say if you're trying to do stuff with documents of over 2000 words You might want to look at urlm fit Um Try transformers see if it works for you, but you know, I'd certainly try both um for under 2000 words You know transformers should be fine unless you've got a You know nothing but like a laptop gpu or something with not much memory um so um Huckingface transformers uh has these You know as I say it right now that I find them somewhat obscure and not particularly well documented expectations about your data um that you kind of have to figure out and one of those is that it expects that your um target Is a column called labels So once I figured that out I just went Got our tokenized data set and renamed our score column to labels and everything started working So probably is you know, I don't know if at some point they'll make this a bit more flexible, but um Probably best to just call your target labels and life will be easy Um, you might have seen back when I went ls path that there was another data set there called test.csv And if you look at it it looks a lot like Um our training set does train our other csv that we've been working with but it's missing The score the labels Um, this is uh, this is called a test set And so we're going to talk a little bit about that now because My claim here is that perhaps the most important idea in machine learning Is the idea of having separate training validation and test data sets Yeah, so um test and validation sets are all about identifying and controlling for something called overfitting And we're going to try and learn about this through example So this is the same information that's in that Kaggle notebook. I've just put I'm put it on some slides here um, so I'm going to create a A function here called plot poly and I'm actually going to use the same data that I don't know if you remember we used it um earlier for trying to Fit this quadratic we created a X and some x and some y data because this is the data we're going to use and we're going to use this to look at overfitting um So the details of this function don't matter too much Um, what matters is what we do with it Which is that it allows us to basically pass in The degree of a polynomial. So for for those of you that remember Uh, a first degree polynomial is just a line. It's y equals a x a second degree polynomial will be y equals a squared x plus b x plus c Uh, third degree polynomial will have a cubic fourth degree, you know, quartic and so forth Um, and what I've done here is I've plotted what happens if we try to fit a line To our data It doesn't fit very well Um, so what happened here is we Uh, we did a linear regression And what we're using here is a very cool library called uh, scikit-learn scikit-learn is something that you know I think it'd be fair to say it's mainly designed for kind of classic machine learning methods Um, like kind of linear regression and stuff like that Um, I mean very advanced versions of these things But it's also great for doing these quick and dirty things. So in this case, I wanted to do a What's called a polynomial regression, which is fitting a polynomial to data And it's just these two lines of code. It's a super nice library So in this case a degree one polynomial is just a line So I fit it and then I show it with the data And there it is Now that's what we call underfit, which is to say there's not enough kind of complexity in this model I fit To to match the data that's there Um So an underfit model is a problem. It's going to be systematically biased, you know, all the stuff up here We're going to pretty predicting too low all the stuff down here. We're predicting too low all the stuff in the middle We'll be predicting too high A common misunderstanding is that like simpler models are kind of More reliable in some way, but models that are too simple Will be systematically incorrect as you see here What happens if we fit a 10 degree polynomial? Um, that's not great either In this case, it's not really showing us What the actual remember this is originally a quadratic this is meant to match, right? And particularly at the ends here It's predicting Things that are way above what we would expect in real life Right, and it's trying to get really it's trying really hard to get through this point But clearly this point was just some noise Right, so this is what we call overfit It's done a good job of fitting to our exact data points But if we sample some more data points from this distribution Honestly, we probably would suspect they're not going to be very close to this Particularly if they're a bit beyond the edges So that's what overfitting looks like. We don't want underfitting or overfitting now Underfitting is actually pretty easy to recognize because we can actually look at our training data and see that it's not very close Overfitting is a bit harder to recognize because the training data is actually very close Now on the other hand Here's what happens if we fit the quadratic And here I've got both the Real line and the fit line and you can see they're pretty close and that's of course What we actually want So how do we tell Whether we have something more like this Or something more like this Well, what we do is we do something pretty straightforward is we take our original data set these points And we remove a few of them. So let's say 20 percent of them We then fit our model using only those points we haven't removed And then we measure how good it is By looking at only the points we removed So in this case, let's say we had removed I'm just trying to think if I'd removed this point here Right, then it might have kind of gone off down over here And so then when we look at how well it fits we would say, oh this one's miles away The model the data that we Take away and don't let the model see it when it's training. It's called the validation set So in first day, I we've seen splitters before right the splitters are the things that separate out the validation set First day, I won't let you train a model without a validation set First day, I always shows you your metrics. So things like accuracy measured only on the validation set. This is really unusual. Most libraries Make it really easy to shoot yourself in the foot by not having a validation set or accidentally not using it correctly So first day, I won't even let you do that So you've got to be particularly careful when using other libraries Huckingface transformers is good about this. So they make sure that they do Um, show you your metrics on a validation set Now creating a good validation set is not generally as simple as just randomly pulling some of your data out of your model Out of the data that you've asked that you train for your model um, the reason why is Imagine that this was the data you were trying to fit something to Okay, and you randomly remove some so it looks like this That looks very easy Doesn't it because you've kind of like still got all the data you would want around the points And in a time series like this, this is dates and sales In real life, you're probably going to want to predict future dates So if you created your validation set by randomly removing stuff from the middle It's not really a good indication of how you're going to be using this model Instead you should truncate and remove the last couple of weeks So if this was your validation set and this is your training set That's going to be actually testing whether you can use this to predict the future Rather than using it to predict the past Cackle competitions are a fantastic way to test your ability to create a good validation set Because cackle competitions only allow you to submit generally a couple of times a day The data set that you Are scored on in the leaderboard during that time is actually only A small subset in fact it's a totally separate subset to the one you'll be scored on on the end of the competition And so most beginners on caggle overfit And it's not until you've done it that you will get that visceral feeling of like oh my god I overfit In the real world outside of caggle You will often not even know that you overfit You just destroy value of your organization silently So it's a really good idea to do this kind of stuff on caggle a few times first in real competitions To really make sure that you are confident You know how to avoid overfitting how to find a good validation set and how to interpret it correctly And you really don't get that until you screw it up a few times good example of this Was there was a distracted driver competition on caggle. There were these kind of pictures from inside a car And um The idea was that you had to try and predict whether somebody was driving in a distracted way or not And on caggle they did something pretty smart the test set So the thing that they scored you on on the leaderboard contained people that didn't exist At all In the competition data that you train the model with So if you wanted to create an effective validation set in this competition You would have to make sure that you separated the photos So that your validation set contained photos of people that aren't In the data you're training your model on Um, there was another one like that the caggle fisheries competition Um, which had boats that didn't appear So there were basically pictures of boats and you're meant to try to guess uh, predict what fish were in the pictures And it turned out that a lot of people Accidentally figured out what the fish were by looking at the boat because certain boats tended to catch certain kinds of fish And so by messing up their validation set they were really overconfident Of the accuracy of their model um, I'll mention in passing If you've been around caggle a bit you'll see people talk about cross validation a lot I'm just going to mention be very very careful Um cross validation is explicitly not About building a good validation set Um, so you've got to be super super careful if you ever do that Um, another thing I'll mention is that scikit learn Conveniently offers something called train test split Um, uh, as does hugging phase data sets. Um as does Fast ai we have something called random splitter. Um, it can be um Encouraged it can almost feel like it's encouraging you to use a randomized Validation set because there are these methods that do it for you Um, but yeah be very very careful because very very often that's not what you want Okay, so we've learned what a validation set is So that's the bit that you pull out of your data that you don't train with but you do Measure your accuracy with So what's a test set? It's basically another validation set But you don't even use it for tracking your accuracy while you build your model Why not? Well, imagine you tried two new models every day for three months. That's how long a cackle competition goes for So you would have tried 180 models And then you look at the accuracy on the validation set for each one Some of those models you would have got a good accuracy on the validation set Potentially because a pure chance just a coincidence And then you get all excited and you submit that to cackle and you think you're going to win the competition And you mess it up and that's because you actually over fit Using the validation set So you actually want to know Whether you have really found a good model or not So in fact on cackle they have two two test sets They've got the one that gives you feedback on the leaderboard during the competition and a second Date a test set which you don't get to see until after the competition is finished So in real life You've got to be very careful about this not to try so many models during your model building process that you accidentally find one that's good by coincidence And only if you have a test set that you've held out Or you know that Now that leads to the obvious question which is very challenging is if you spent three months working on a model Worked well on your validation set. You did a good job of locking that test set away in a safe So you weren't allowed to use it and at the end of the three months you finally checked it on the test set And it's terrible What do you do? Honestly, you have to go back to square one You know, there really isn't any choice other than starting again So this is tough, but it's better to know right better to know than to not know So that's what a test sets for So you've got a validation set. What are you going to do with it? Um, what you're going to do with a validation set is you're going to measure some metrics So a metric is something like accuracy. It's a number That tells you How good is your model? Now on Kaggle This is very easy What metric should we use? Well, they tell us Go to overview second evaluation And find out and it says, oh, we will evaluate on the Pearson correlation coefficient Therefore, this is the metric you care about So one obvious question is is this the same as the loss function? Is this the thing that we will take the derivative of? And find the gradient and use that to improve our parameters during training And the answer is Maybe sometimes But probably not For example Consider accuracy Now if we were using accuracy to calculate our derivative and get the gradient You can have a model that's actually slightly better You know, it's slightly like it's doing a better job of recognizing dogs and cats But not so much better that it's actually caused any Incorrectly classified cat to become a dog So the accuracy doesn't change at all So the gradient zero You don't want stuff like that. You don't want bumpy Functions because they don't have nice gradients often. They don't have gradients at all. They're basically zero nearly everywhere You want a function that's nice and smooth Something like for instance The average absolute error mean absolute error which we've used before So that's the difference between your metrics and your loss now be careful Right because when you're training your model spending all of its time trying to improve the loss And most of the time that's not the same as the thing you actually care about which is your metric So you've got to keep those two different things in mind The other thing to keep in mind Is that in real life You can't go to a website and be told what metric to use In real life The the the model that you choose There isn't one number that tells you whether it's good or bad And even if there was you wouldn't be able to find it out ahead of time In real life The model you use is a part of a complex process Often involving humans both as users or customers and as people You know involved in in as part of the process There's all kinds of things that are changing over time And there's lots and lots of outcomes of decisions that are made One metric is not enough to capture all of that unfortunately because it's so convenient to pick one metric and Use that to say I've got a good model that very often finds its way into Into industry into government Where people roll out these things that are good on the one metric that happened to be easy to measure And again and again we found People's lives turned upside down Because of how badly they get screwed up by models that have been incorrectly measured using a single metric So my partner retro thomas has written this article which I recommend you read about the problem with metrics is a big problem for AI It's not just an AI thing There's actually this thing called good arts law that states when a measure becomes a target It ceases to be a good measure The thing is and so when I was a management consultant, you know 20 years ago We were always kind of part of of these Strategic things trying to like find key performance indicators and ways to kind of you know Set commission rates for sales people and we were really doing a lot of this like stuff Which is basically about picking metrics And you know we see that happen go wrong in industry all the time AI is Dramatically worse because AI is so good at optimizing metrics And so that's why you have to be extra extra extra careful about metrics When you are trying to use a model in real life Anyway, as I said in Kaggle, we don't have to worry about any of that We are just going to use the Pearson correlation coefficient, which is all very well As long as you know what the hell the Pearson correlation coefficient is If you don't let's learn about it so Pearson correlation coefficient is usually abbreviated using letter r and it's the most widely used measure of How similar two variables are And so if your predictions are very similar to the real values Then the Pearson correlation coefficient Will be high And that's what you want R can be between minus one and one Minus one means you predicted exactly the wrong answer Which in a Kaggle competition be great because then you can just reverse either of your answers and you'll be perfect Plus one means you got everything exactly correct Generally speaking in courses or textbooks when they teach you about the Pearson correlation coefficient at that point at this point They will show you a mathematical function I'm not going to do that because that tells you nothing about the Pearson correlation coefficient What we actually care about is not the mathematical function about how it behaves and I find most people even who work in data science have not actually looked at a bunch of data sets to Understand how r behaves so let's do that right now so that you're not one of those people The best way I find to understand how data behaves in real life is to look at real life data So there's a data set scikit-learn comes with a number of data sets and one of them is called california housing And it's a data set where each row is a district And it's kind of demographic. It's sorry. It's information some demographic information about different districts and about the value of houses in that district I'm not going to try to plot the whole things. It's too big and this is a very common question I have from people is how do I plot data sets with far too many points? The answer is very simple get less points So I just randomly grab a thousand points Whatever you see with a thousand points is going to be the same as what you see with a million points There's no point no reason to plot huge amounts of data generally just grab a random sample Now um NumPy has something called core cof to get the correlation coefficient between every variable and every other variable And it returns a matrix So I can look down here and so for example here is the correlation coefficient between variable one and variable one Which of course is exactly perfectly 1.0 right because variable one is the same as variable one Here is the small inverse correlation between variable one and variable two and medium sized positive correlation between variable one and variable three and so forth This is symmetric about the diagonal because the correlation between variable one and variable eight Is the same as the correlation between variable eight and variable one So this is a correlation coefficient matrix Um, so that's great when we wanted to get a bunch of values all at once For the Kaggle competition. We don't want that. We just want a single correlation number If we just pass in a pair Of variables, we still get a matrix Which is kind of weird. It's kind of it's not weird. It's not what we want. So we should grab one of these So when I want to grab a correlation coefficient I'll just return the zero through first column So that's what core is that's going to be our single correlation coefficient So let's look at the correlation between two things For example Median income and medium house value Point six seven. Okay. Is that high medium? low How big is that? What does it look like? So the main thing we need to understand is what these things look like. So what I suggest we do is we're going to take a 10 minute break nine minute break We'll come back at half past and then we're going to look at some examples of correlation coefficients Okay, welcome back So what I've done here is I've created a little function called show correlations I'm going to pass in a data frame and a couple of columns As strings I'm going to grab each of those columns a series do us get a plot and then Show the correlation So we already mentioned medium income and medium house valuation Of point six eight. So here it is. Here's what point six eight looks like So, you know, I don't know if you had some intuition about what you expected But as you can see it's Still plenty of variation Even at that reasonably high correlation Um also You can see here that visualizing your data is very important if you're working with this data set Because you can immediately see all these dots along here That's clearly Drunkation, right? So this is like when it's not until you look at pictures like this that you're going to pick stuff like this Pictures are great Oh little trick on the scatter plot I put alpha as point five that creates some transparency For these kind of scatter plots that's Really helps because it like kind of creates darker areas in places where there's lots of dots So, uh, yeah alpha and scatter plots is nice Okay, here's another pair So this one's gone down from point six eight to point four three median income versus the number of rooms per house As you'd expect More rooms It's more income But this is a very weird looking thing Now you'll find that a lot of these statistical measures like correlation Rely on the square of the difference And when you have big outliers like this the square of the difference goes crazy And so this is another place we do want to look at the data first you say oh, that's That's going to be a bit of an issue There's probably more correlation here But there's a few examples of some houses with lots and lots of room where people that aren't very rich live Maybe these are some kind of shared shared accommodation or something So r is very sensitive to outliers So let's get rid of the houses the rooms with uh 15 rooms the houses with 15 rooms or more And now you can see it's gone up from point four three To point six eight even though we probably only got rid of one two three four five six even got rid of seven data points So we're going to be very careful of outliers and that means if you're trying to win a cackle competition Where the metric is correlation And you just get a couple of rows really badly wrong Then that's going to be a disaster to your score, right? So You got to make sure that you do a pretty good job of every room So there's what a correlation of point six eight looks like Okay, here's a correlation of point three four and this is kind of interesting, isn't it because point three four Sounds like quite a good relationship, but you almost can't see it So this is something I strongly suggest is if you're working with a new metric is draw some pictures Of a few different levels of that metric to kind of try to get a feel for like what does it mean? You know, what does point six look like what does point three look like and so forth And here's an example of a correlation of minus point two This very slight negative slope Okay, so there's just more of a kind of a general tip of something I like to do when playing with a new metric and I recommend you do as well I think we've now got a sense of what the correlation feels like Now you can go look up the equation on wikipedia if you're into that kind of thing um We need to Report the correlation after each epoch because we want to know how our training is going hugging face Expects you to return a dictionary Because it's going to use the keys of the dictionary to like label each metric so here's something that gets the correlation and Returns it as a dictionary with the label Pearson Okay, so we've done metrics. We've done our training validation Split Uh, oh, we might have actually skipped over the bit where we actually did the split today I did um, so to actually do the split Because in this cargo competition Um I've got another notebook We'll look at later where we actually split this properly, but here we're just going to do a random split Just to keep things simple for now of 25 percent will be of the data will be a validation set So if we go ds train test split It returns a data set dict Which has a train and a test So that looks a lot like a data sets object in fast ai very similar idea So this will be the thing that we'll be able to train with So it's going to train with this data set and return the metrics on this data set. This is really a validation set but hugging face Data sets calls a test Okay, we're now ready to train our model In fast ai we use something called a learner the equivalent in hugging face transformers is called trainer So we'll bring that in Um Something we'll learn about quite shortly is the idea of mini batches and batch sizes in short Each time we pass some data to our model for training It's going to return it's going to send through a few rows at a time to the gpu so that it can calculate those in parallel Those a bunch of rows is called a batch or a mini batch and the number of rows is called the batch size So here we're going to set the batch size to 128 Generally speaking the larger your batch size the more it can do in parallel at once and it'll be faster But if you make it too big you're going to out of memory error on your gpu So, you know, it's a bit of trial and error to find a batch size that works epochs we've seen before Then we've got the learning rate We'll talk In the next lesson Unless we get to this lesson About a technique to automatically find a or semi-automatically find a good learning rate We already know what a learning rate is from the last lesson I played around and found one that seems to train quite quickly without falling apart So I just tried a few Generally, I kind of you know, if I if I don't have a So Hugging face transformers doesn't have something to help you find the learning rate This the integration we're doing in fast. I will let you do that But if you're using a framework that doesn't have that you can just start with a really low learning rate And then kind of double it and keep doubling it until it falls apart Hugging face transformers uses this thing called training arguments, which is a class we just provide all of the kind of configuration So you have to Tell it what your learning rate is Um This stuff here is the same as what we call basically fit one cycle in fast ai You always want this to be true because it's going to be faster pretty much Um And then the this stuff here you can probably use exactly the same every time There's a lot of boilerplate compared to fast ai as you see um This stuff you can probably use the same every time okay, so, um We now need to create our model so the equivalent of The vision learner function that we've used to automatically create a reasonable vision model in Hugging face transformers they've got lots of different ones depending on what you're trying to do So we're trying to do classification as we've discussed of sequences So if we call auto model for sequence classification It will create a model that is appropriate for classifying sequences From a train pre-trained model And this is the name of the model that we just did earlier the deburta v3 It has to know when it adds that random matrix to the end how many outputs it needs to have so we have one label Which is the score So that's going to create our model and then this is the equivalent of creating a learner It contains a model and the data Training data and the test data again. There's a lot more boilerplate here than fast ai But you can kind of see the same basic steps here. We just have to do a little bit more manually But it's not you know, there's nothing too crazy Um, so it's going to tokenize it for us using that function And then these are the met matrix matrix that it will print out each time That's that it'll function we created which returns a dictionary Um At the moment I find hugging face transformers very verbose It spits out lots and lots and lots of text which you can ignore And we can finally call train which will spit out much more text again, which you can ignore And as you can see is it trains it's printing out the loss And here's our Pearson correlation coefficient So it's training and uh, we've got a 0.834 correlation. That's pretty cool, right? I mean It took um, what does it actually say but it just took a here we are five minutes to run Or maybe that's five minutes per epoch on Kaggle, which doesn't have particularly great GPUs but good for free And we've got something that is you know, got a very high level of correlation In in assessing how similar the two columns are And the only reason it could do that Is because it used a pre-trained model, right? There's no way you could just have that tiny amount of information and figure out whether those two columns are very similar Um, this pre-trained model already knows a lot about language It already has a good sense of whether two phrases are similar or not And we've just fine-tuned it. You can see given that after one epoch it was already at 0.8 You know, we this was a model that already did something pretty close To what we needed. It didn't really need that much extra tuning for this particular task We've got any questions there John Yeah, we do it's actually a bit back on the topic before where you were showing us the Visual interpretation of the Pearson coefficient and you're talking about outliers. Yeah, um, and we've got a question here from Kevin asking How do you decide when it's okay to remove outliers like you you you pointed out some in that data set And clearly your model is going to train a lot better if you clean that up but And I think Kevin's point here is um, you know those kinds of Outliers will probably exist in the test set as well. So I think he's just looking for some practical advice on on how you handle that in a more general sense so outliers Should never just be removed like for modeling um So if we take the example of the california housing data set, you know, if I was really Working with that data set in real life. I would be saying. Oh, that's interesting It seems like there's a separate group of districts with a different kind of behavior Yeah, my guess is that they're going to be kind of like dorms or something like that, you know, probably low-income housing um And so I would be saying like oh Clearly from looking at this data set these two different groups can't be treated the same way They have very different behaviors and I would probably split them into two separate analysis um You know the the word outlier um In it kind of exists in a statistical sense, right? There can be things that are well outside our normal distribution and mess up our kind of metrics and things It doesn't exist in a real sense. It doesn't exist in a sense of like oh things that we should like ignore or throw away You know, um, some of the most useful kind of insights I've had in my life in data projects has been by Thinking into outliers so-called outliers and understanding. Well, what are they and where did they come from? And it's kind of often in those edge cases that you discover Really important things about like where processes go wrong Or about, you know, kinds of behaviors you didn't even know existed Or indeed about, you know, kind of labeling problems or process problems Which you really want to fix them at the source because otherwise when you go into production You're going to have more of those so-called outliers So, yeah, I'd say never delete outliers Without investigating them and having a strategy for like understanding where they came from and like what should you do about them? All right, so now that we've got a trained model You'll see that it actually behaves, uh, you know, really a lot like a fast AI learner And you know, hopefully the impression you'll get from going through this process is Largely a sense of familiarity. I've like, oh, yeah, this looks Like stuff I've seen before, you know, like a bit more wordy and some slight changes But it really is very very similar to the way we've done it before Because now that we've got a trained trainer rather than learner we can call predict And now we're going to pass in Our data set from the Kaggle test file And so that's going to give us our predictions Which we can cast to float And here they are So here are the predictions we made of similarity now Again, not just for your inputs, but also if your outputs always look at them Always right and interestingly I looked at quite a few Kaggle notebooks from other people for this competition And nearly all of them had the problem we have right now which is negative predictions and predictions over one So I'll be showing you how to fix this in a more Proper way, maybe hopefully in the next lesson But for now, you know, we could at least just round these off, right? Because we know that none of the scores are going to be bigger than one or smaller than zero So our correlation coefficient will definitely improve if we at least round this up to zero and round this down to one As I say, there are better ways to do this, but that's certainly better than nothing So in PyTorch, you might remember from when we looked at RLU. There's a thing called clip And that will clip everything under zero to zero and everything over one to one And so now that looks Much better So here's our predictions So Kaggle expects submissions to generally be in a CSV file And Hackingface datasets It kind of looks a lot like pandas really we can create our submission file from with our two columns called dot CSV and There we go That's basically it so Yeah, you know, it's it's it's kind of nice to see how Um, you know in the sense how far deep learning has come since we started this course a few years ago That that nowadays You know, there are multiple libraries around to kind of do this the same thing we can, you know Use them in multiple application areas. They all look kind of pretty familiar. They're reasonably beginner friendly And nlp because it's kind of like The most recent area that's really become effective In the last year or two is probably where the biggest opportunities are for You know big wins both in research and commercialization And so if you're looking to build a startup, for example, one of the key things that vcs look for, you know, that they'll ask is like, well Why now, you know, why why would you build this company now? And of course, you know with nlp the answer is really simple. It's like it can often be like well Until last year this wasn't possible You know where it took 10 times more time or it took 10 times more money or whatever So I think nlp is a huge opportunity area okay, so it's worth thinking about Both use and misuse of modern nlp And I want to show you a subreddit Here is a conversation on a subreddit from a couple of years ago. I'll let you have a quick read of it So the question I want you to be thinking about is What subreddit do you think this comes from this? this debate about military spending And the answer is it comes from a subreddit that posts automatically generated conversations between gp2 gpt2 models Now this is a like a totally previous generation of model. They're much much better now So even then you could see these models were generating context appropriate believable pros You know, I would strongly believe that like Any of our kind of like upper tier of competent fast ai alumni would be fairly easily able to create a A bot which could create context appropriate pros on twitter or facebook groups or whatever You know arguing for a side of an argument And you could scale that up such that 99 percent of twitter was these bots and nobody would know You know, um, nobody would know and that's very worrying to me because a lot of You know a lot of kind of the way people see the world is now really coming out of their their social media Conversations, which at this point they're they're controllable like It would not be that hard to create something that's kind of optimized towards Moving a point of view amongst a billion people Um, you know in a very subtle way very gradually over a long period of time by multiple bots each pretending to argue with each other with each other and one of them getting the upper hand and so forth um Here is the start of a Um article in The Guardian, which I'll let you read This article was you know quite long. This is just the first few paragraphs And at the end it explains that this article was written by GPT-3 Uh, it was given the instruction. Please write a short op-ed around 500 words. Keep the language simple and concise Focus on why humans have nothing to fear from AI um, so um GPT-3 produced eight outputs And then they say basically the the editors at The Guardian did about the same level of editing that they would do for Humans in fact they found it a bit less editing required than humans so, um, you know again like a You can create longer pieces of context appropriate pros designed to argue a particular point of view What kind of things might this be used for? You know, we won't know probably for decades if ever but sometimes we get a clue based on older technology Here's something from back 2017 and the pre kind of deep learning in LP days There were millions of submissions To the FTC about the net neutrality situation in america Very very heavily Bias towards the point of view of saying we want to get rid of net neutrality An analysis by Jeff cowl showed that Something like 99 of them and in particular nearly all of the ones which were pro removal of net neutrality Were clearly auto-generated By basically if you look at the green There's like selecting from a menu. So we've got Americans as opposed to washington bureaucrats deserve to enjoy the services they desire Individuals as opposed to washington bureaucrats should be to slow People like me as opposed to so-called experts should be and you get the idea right now. This is an example of a very very, you know simple Approach to auto-generating huge amounts of text We don't know for sure, but it looks like this might have been successful because This went through, you know, despite what seems to be actually overwhelming Disagreement from the public that everybody almost everybody likes net neutrality the FTC got rid of it And this was a big part of the basis was like, oh, we got all these comments from the public and everybody said they don't want net neutrality Um, so imagine a similar thing Where you absolutely couldn't do this you couldn't figure it out because everyone was really Very compelling and very different That's, you know, it's kind of worrying about how we deal with that You know, I will say when I talk about this stuff often people say, oh, no worries. We'll build a model to recognize Um, you know bot-generated content Um, but you know if I put my black hat on I'm like, no that's not going to work right if you told me to build something that Beats the bot classifiers. I'd say no worries easy Is you know, I will take the the code or the service or service or whatever that does the bot classifying And I will include beating that in my loss function And I will fine-tune my model until it beats the bot classifier, you know When I used to run an email company, we had a similar problem with spam prevention You know spammers could always take a spam prevention algorithm And change their emails until it didn't get the spam prevention algorithm anymore, for example So yeah, so I I'm really excited about the opportunities for for Students in this course to build, you know, I think Very valuable businesses are really cool research and so forth using These pretty new nlp techniques that are now pretty accessible And I'm also really worried about the things that might go wrong I do think though that the more people that understand these capabilities The less chance they'll go wrong John was there some questions Yeah, I mean it's a throwback to the to the workbook that you had before Yeah, that's the one the question Menekundan is asking shouldn't num labels be five zero zero point two five zero point five zero point seven five one Instead of one isn't the target a categorical or are we considering this as a regression problem? Yeah, it's a good question. So There's one label because there's one column Even if this was being treated as a categorical problem with five categories, it's still considered one label In this case though, we're actually Treating it as a regression problem It's just one of the things that's a bit tricky. I was trying to figure this out just the other day It's not documented as far as I can tell But on the hugging first transformers website, but if you pass in one label To auto model for sequence classification. It turns it into a regression problem, which is actually why we ended up with predictions that were less than zero and and bigger than one so We'll be learning next time about the use of sigmoid functions to resolve this problem And that should that should fix it up for us Okay, great. Well, thanks everybody. I hope you enjoyed learning about NLP as much as I enjoyed putting this together I'm really excited about it and I can't wait for next week's lesson. See you