 Yeah. So before I dive into the talk, I also want to mention that, you know, you can ask questions during the talk. I'm happy to take them. Also, I think there should be a lot of time in the end where you can ask me more questions. Cool. So before going into embeddings, I want to give kind of how things looked before we started doing embeddings at OpenAI. So OpenAI, I think like a lot of our big releases were using generative models. Generative models are basically trained to maximize the likelihood of data and they can generate realistic content. And apart from like, you know, just generating realistic content, usually when a model is trained to maximize the likelihood, it starts having useful representations that are then useful for many downstream tasks. So for example, one of our first releases with the generative modeling paradigm was the GPT class of models. So here like, you know, the first two to three lines is like the prompt given to the model. And then the model writes the rest of the poem. And, you know, apart from writing poems, it's like really good at many, many things like summarizing machine translation and so on. Like anywhere you can think, anywhere you want like text to be generated. It turned out that having a very good generative model that's trained on lots of unsupervised data also was just like good at solving many tasks in what people refer to as few short models. So here, you know, we give the model few examples of like decide behavior. So, you know, it starts with like a poor English input followed by how it's fixed. And you give it a few examples like how it's shown here. And then in the end, you give it a new poor English input, like I'd be more than happy to work with you in another project. And the model now generates the corrected version. And this is, this was a big deal. You know, typically machine learning models work by having lots of tasks specific data. And here you had this model that was trained to maximize the likelihood of internet data, basically. And it had this like, this new behavior emerged from that training. And this was like one of the reasons why I think GPT-3 got a lot of attention. So similar to kind of text generation, we also started doing work on code generation. So here, you give it, you know, your helper function, and, and then a prompt, like, you know, the, it's like the comment in the first line. And then the model fills out the code for you. So we have things like co-pilot, which was done in collaboration with GitHub, that you can check it out. You can play with these models if you, if you install that. And yeah, these models are pretty good at generating code given, given the text description of the task you want. Cool. And this kind of like same principle of generative models. Went from text to code. And then now recently we, we put out Dolly too, where you give it a text description as shown in the right. And then the model generates the image for you. This is, this is a pretty cool application of, again, this high level idea of like training generative models that maximize the likelihood of your data. And kind of like the same principle works across modalities. And you don't really need a lot of task specific data. You can train these models on, you know, internet data. And, and these are starting to work really well. And to give you a little bit more details on, you know, how these generative models work. So here is like an example for, for text generation. Let's say it's like a character level model. So let's say you have the word hello. So you get the, the model reads, when it's at the character level language model, it reads one character at a time. So that's the input, right? That's like H followed by E and so on. And then at every time step, the model has to predict the next character that comes in the sequence. So at time step one, it has to predict E and then L and then so on. Okay. And how does it go about doing it? It, it first like has like a unique place for every character. And then there's a lookup table that generates what is called as embeddings. And what are these, these are just vector representations that the model slowly gets better at doing over training. And these vector representations are basically optimized to predict the next, next item in the sequence. Okay. And then it turns out, you know, these models are deep, which means that there are multiple layers of vector representations before the model predicts the output. So this is a standard deep learning setup. The thing with generative models is, you know, the, all the intermediate vector representations or embeddings, there is not really a single vector or the models are not explicitly optimized to produce a single vector or embedding that captures whole of the input. Here, like the, because the training objectives is to predict the next character, there's not like a place after training, there's not like a vector you can go back in the model and say, this has the most information about the input. And that, that property of like having a single vector representation for an input could actually be useful for many tasks. So think about an application like search. You want to search on like million documents, billion documents or whatever, right? In those cases, the documents, you need to like process them, usually typically like with Google, for example, you would process them offline, compute some kind of representation and build an index. This is how like search systems work. And there's no natural way to do this with, with like an output of a generative model. You can't really use the you can't really use it to build kind of vector indices of large collections. And then data visualization. So for things like clustering or any kinds of like visual analysis of data, again, it would be so useful to have a single vector that represents whole of the input. And then finally, this is what people refer to as linear pro classification. Here the idea is quite simple. You have a powerful model that gives you a vector representation of the input or embedding of the input. And then you could train like a simple linear model using those embeddings as features. This is a very common setup, both in research and in the industry. And so there's basically a ton of applications where having a single embedding of the input is quite important. And kind of like the goal of our work is can we get models, unsupervised models that are good at getting these kind of single embedding for a given input? Okay. So if you take up, so like generative models, there's a class of models called contrastive models. And what they do is they take up pair data as input. The model produces a single embedding for each side. And then the score of similarity is given by cosine similarity between these two vectors. And I'll get into more details about this particular setup. But for now, the most important takeaway is that these contrastive models are explicitly optimized to learn a single embedding of the input. And the goal of this work is can we learn high-quality text and code embeddings, just an unsupervised data with contrastive models? Cool. So this is the high-level outline of things I'm going to discuss today. Okay. So I'll go slightly more in detail about the contrastive models. Any questions so far? Okay. Cool. I will keep going. Cool. So an embedding model, the basic kind of block is you're given an input X. You kind of add start of sequence and end of sequence tokens in the start and the beginning. And then you encode that with a neural network model, and then you get back a single vector representation of the whole input. One way to do this, and this is how we do it in our work, is you could take an architecture called transformer. If you're familiar with it, the transformer model is basically like any other kind of deep learning model that converts that has multiple layers and it takes the input and gradually builds vector representations at multiple levels of abstraction. And in our work, we basically say the embedding of the input is the last vector embedding, last in terms of the time step and from the last layer. And this is the embedding we're going to use for computing similarity scores. And the second one, the second building block is computing similarity between two sequences. So this is how the model is trained. This becomes like a basic building block. And here, let's say you're given two inputs X and Y. You encode them, you get their embeddings, and then the similarity is given by the cosine similarity between these two vectors. So this is quite simple on how we go about it. And to train these models, you would need pair data. So the X and Y you saw before, you need some way to get similar pairs. And similarity is this like, if you think about similarity for data, text or code, it's hard to pin down that this is the only notion of similarity. Like, for example, a common thing that's hard to label is a thing and it's negation similar. And it kind of depends on the application. So you'll see this kind of comeback where it is not completely well-defined and that leads to some issues. But apart from that, the idea is the pair data, the pair data you supplied during training are basically how you define similarity to be. And then the model basically learns to put two things that you provide are similar close to each other in the vector space. Kind of more details here is the input is a list of paired examples. And then when you do gradient descent, you take a batch of your training data and then you have a single X comma Y pair that access the positive example. And there's no explicit negative example that is provided. All this is done because then it's easier to consume unsupervised data. So the negative examples are, so if you have X comma Y and then you have a bunch of other examples, then all the other examples in the batch act as negatives for a given example. So if you have like a batch size of n, there are n minus one negatives for each example. And this is just like an assumption that these are negative examples. This is motivated by the fact that that if you construct negatives this way, you can reuse a lot of the computation and training becomes very efficient. Cool. So the next thing I want to discuss here is what's the source of paired data? And again here, our goal is can we train these models with unsupervised data? And for text, we just say neighboring pieces of text. So this could be like neighboring paragraphs, a sentence in one paragraph and it's adjacent paragraph and things like that. Those are the positive examples. And then for code, it's the top comment for a function followed by the code itself. So the paired data basically comes from there. So before we dive into more details about the exact recipe, I want to share a quick early experiment that we did. And so the batch size turns out to be really, really crucial with these models. So by increasing it by an order of magnitude, the performance on, for example, here we show a search task, the performance goes up quite a lot. And intuitively, this is happening because, as I said, if you have n examples in your batch, for every example, there is n minus one negatives. So by simply increasing the batch size, the chances that the model is going to see like a hard negative example that it has to contrast with, that increases and that makes the model perform much, much better. And this was like one of the crucial insights coming out of this work. And a lot of our kind of research later on was like how to actually scale batch size efficiently without blowing up the number of GPUs you would need to train these models. Cool. So I'll come, I'll now describe just the high level recipe to train these models. It's actually the high level recipe is like pretty simple. You collect par data. The par data basically gives you the notion of similarity that you want from these models, and then train with sufficiently large batch size. One more detail we do is like the encoder is actually initialized with some pre-trained model. We found that the contrast of loss on its own is not good enough as a training signal to learn high quality embeddings. So what would, what this means is, you know, let's take text embeddings. We initialize the encoder with GPT, which is the text generation model. We collect par data of text pairs just from the internet. And then we train with a sufficiently large batch size. So for now code embedding models, we initialize with codex, which is the code generation model. And then we collect comment, code pairs again from open source code. And then we train with sufficiently large batch size. So kind of like the same recipe works pretty well for generating these high quality embeddings. And, you know, before we dive into the results, I want to make a point on like why our work is different from previous work. Almost all previous work basically trains kind of separate embedding models for every task. You know, if you want to use it for classification features, you have an embedding, different embedding model. If you have a, if you want something for search, you have a completely different model, completely different as in like, you know, the architecture to training data, to loss function, everything is kind of changed. Everything changes. And for us, we wanted to try to see if we can get all these kinds of desired downstream behavior from a single unsupervised model. And that was the big motivation for the work. And next I'll discuss some experiments that we did that seems to indicate that we have this like single powerful unsupervised embedding model. Cool. So this is like a simple experiment for data visualization. So we took Amazon reviews, you know, you get like 1000 order of 1000 dimensional vectors from the embedding model. And then you do load dimensional project projection. Here we reduce it to two dimensions. And basically, you know, the reviews lineup according to their sentiment, even though the model actually never saw the you know, the sentiment label in the dataset, it kind of inferred the property just from, just from the text. So yeah, the green colored ones are, you know, we basically project the text into two dimensions. And then we color them based on whether they got a positive review or a negative review. The negative one has, you know, the shades of red color and the positive ones have shades of green color. And yeah, as you can see, there's like a clear separation between the positive and the negative. The model has just picked up without any label data. So as I said, these models are trained with very large back size. So we train four different models with different number of parameters in the model. And then the embedding dimensions go from 1000 all the way to 12,000. Okay. So this is like, you know, details on the setup. And the first experiment we did, this is a classification experiment. We tested it on like seven datasets. All these are, you know, tasks like, for example, movie review classification, which is kind of similar to the sentiment classification we talked about you know, there's datasets on topic classification and things like that. So the setup is you're given input text and you want to assign it to one of the many labels. Here, how we evaluate the models is you take the embeddings from the unsupervised model and then you train a simple linear classifier with these as features. So the idea here is if your model produces better features, then it should lead to better downstream performance using these features. So the first thing to note is we start doing quite well, even with the small model. And as we scale up the performance, just almost uniformly goes up across tasks. And we kind of start doing much better than previous work on having these unsupervised models for giving features. Next up, the task is the same. I've added a set of experiment results in the bottom half now. Here, the setting is we take the unsupervised model and then we further fine tune on what is called NLI data. NLI is a short form for natural language inference. So here the data is you're given pairs of sentences and then you know, there's human annotations on whether these two sentences are, you know, whether the relationship between them is entailing, contradicting or neutral. So the model is further fine tuned on this human annotated data. And yeah, again, in this setting, we do better than previous work. And we see a small improvement over just using the unsupervised model. But again, like I feel like probably the bigger takeaway here is the big unsupervised model is actually better than previous approaches that use human annotated data, which was like kind of like the main focus of this research work. Cool. So the next set of experiments are on text search. And all these are zero short evals. So the model did not see any training data for this particular domain. So you have a bunch of different text search data sets. And, you know, again, we started getting pretty good results just with the unsupervised models. And then we just like before with classification, we took a small amount of examples from MS Marko, which is a public search data set from Microsoft. And then once we do that, we start doing much better than previous embedding methods. You know, for, I mean, it was common in the literature to not even have the unsupervised model evaluated on these tasks because they perform really poorly. And I think like with our approach, we were able to start making these models, do non-trivial things for both classification we discussed before and now for search. And then with further fine tuning, we start getting the best results. You know, the last section of results from previous work are basically not embedding based methods. You know, they use more computation at query time. And those methods are more competitive with the results we show here. Cool. Okay. So our models performed quite well on search and classification. And we found this very interesting behavior. And this comes back to what I was telling you all about how it's hard to define a notion of similarity that is consistent across applications. So there are academic data sets called sentence similarity. And even though our unsupervised models does really well on search and classification, it actually performs really poor on these sentence similarity tasks. We don't have a very concrete understanding of why this is happening. But our intuition is that, you know, similarity, it's like, could mean different things in different contexts. I think there's like a famous quote by linguists on like, you know, there's like infinite notions of similarity, which one should be annotated for. So I think it's coming out of that. I think, you know, as I said, there are tasks where you want a sentence and its negation to be close to each other. And I feel like with search and classification data sets, I think they were annotated in a way where a thing and its negation can be close to each other and it's still fine. And then I think with these tasks, the model had to explicitly kind of put them far apart. And the model did not learn that concept from unsupervised data. And I think that's why the models don't perform that well. You know, kind of like a rough evidence for that is this is like, as we train for more steps, so from 2000 steps to 50,000 steps, the model's performance on search and classification goes up while the performance on the sentence and various tasks actually go down. So yeah, I think if you look at internet text, maybe there is lots more of like, you know, debate and argument of like going back and forth. And they are like kind of like part of the paradigm for training. And the model picks up that to be the same thing. And whereas these tasks want them to be separated. So this is our current intuition on why this is happening. But this was like a cool thing that came out of this whole work that, you know, these models pretty good for search and classification, but quite bad for another set of tasks. Okay, so all the previous experiments were using the text embedding model. Now I will proceed to code search. Code search task is given a language query and a bunch of code blocks, pick the most relevant one. And here, you know, basically we do really well compared to previous approaches as shown by the results. The model was just like really, really good at this task of getting the best code function or code block that captures the thing that's needed by the text query. And we also did like a much larger scale experiment, you know, instead of picking the correct code among 1000 candidates, can it do a harder task of picking the code among 10,000 candidates, right? And, you know, the performance drops a bit compared to the previous one. But it's still pretty good at this task. What I found interesting was even the text embedding model was, especially if you look at Python results in Python, the model is actually quite good at it. Yeah, it's just like probably there's just like lots of code information in internet text. And the model has kind of picked up on Python, just because of large scale and supervised learning. And this was like quite cool that the text embedding model was quite decent for code. Cool. So yeah, so we discussed, you know, these embedding models, high level view of how to train them and lots of experimental results. We also have these models in the OpenAI API now. You can call the API and get back vector representations for text and code. And this is kind of like being used quite a lot, you know, with previous, so here's an example from one of our customers who uses the embedding endpoints. And with the, yeah, with the previous embeddings, there was a little bit more of things getting, you know, not things that you want in the same cluster. So this is like, all these codes are, you know, you take the embedding from the model for a lot of text pieces, and then you do clustering. And so this is what is present inside a single cluster. And you can see with the previous embeddings, the model has like, you know, issues with the app and the good things about the app put in the same cluster. And then with our embeddings, our customer was able to get, you know, basically better clustering outputs. So this is all about kind of things not working. Working. And the second experiment is on like zero short classification. So given a text data point, assign a label to that text piece. So the highest performance, so here it's basically the Y axis is accuracy. The X axis is like number of attempts, basically you take the top K predictions, and even if one of them is right, you give the model points to it. So the best performing one are subject matter experts. These are, you know, trained human editors. And then the next four lines are basically the four different models that I talked about from the API. And then the ones below that are embeddings from previous work. So our model is able to do this task again, pretty, pretty well. And I think it's like quite cool that, you know, an unsurprised model that's just like trained on lots of internet text does really well on many academic data sets. And now the same model is doing well on downstream applications too, in the real world. And I think this is like kind of like a paradigm shift for machine learning of like moving from task specific supervised data to now having relying almost entirely just on unsupervised learning. Cool. So we talked about, you know, contrastive models, how they're different from generated models, how to train them, importance of pad size, you know, the high level summary of how to train these models comparison with previous work. And then I talked about experiments on text classification, text search and code search. And then we also talked about how these models are available in the API. So to conclude, you know, we have this extremely simple recipe to train these embedding models. They are unsurprised models that have a very broad range of capabilities, you know, and these are just trained on internet data. And the same model performs really well on a very broad set of tasks. And, you know, the models are available in our API. So if you're interested, you can play with them with that. That's all I have. Thank you for listening. Thank you so much, Arvind.