 So, I believe this might be one of the more technical talks of the conference, and I hope you're in for it. I'll also try to make it short, assuming this is the last talk of the conference. The original title was a bit different, retrieval augmentation, and then semantic search at scale. Chronologically, this makes more sense. Semantic search was like an earlier problem, and now we all talk about RAG. I'm not going to define any of those terms. You've probably heard like 30 talks about this over the course of two days. If I were to summarize everything, like if you really need to rush to Coltrane or just like leave at this point, the short version is like make search faster and a lamp smaller. This is kind of the end goal. But the way we get there is a bit different from most other companies. So, I run a project called Unum. I spent a lot of years on it, and I have a couple of teams working with me, designing some of the fastest libraries for data processing. So, those libraries are not super well-known, but at the same time, there are some of the most deployed libraries out there, and the reason they're not, like, well-known because they're generally used to power other people's or other company's projects. So where are the guys who write the assembly, either on x86 or Mocuda or the C or the C++ or the CUDA itself, and then it goes into databases or cloud providers. But if you really want to understand how all of this works, or you want to avoid paying all the extra costs, like for the services on top, welcome. So the core ideas that we're going to cover is first and foremost, retrieval augmented generation is one of the approaches to improve the quality of the model output. It's not the only one, so obviously you can tune your model for custom output, or you can use a model with a much longer context window. Retrieval is cheaper, it's also far more scalable. So in my case, I generally work with projects where at least tens of billions of vectors fit into one machine. So this is generally a few orders of magnitude more than most companies will have across the cluster. As for vector search, indexes or databases are predominantly used for scale below one million entries. And I would say that at this level, you probably don't need any of those. You just need to, like, do math really fast. And even on a single CPU core, you can do search within less than 100 milliseconds. And if you're really serious about scale, same way as my title, then you probably just need the raw index, because, like, using a database for a billion scale or a trillion scale might be very expensive. And then, like, on the other side, I'll show some of the interesting models that you can use to tune to build a mixture of expert system. The underlying story behind my company and this project and the eight years of work is the idea that modern hardware is far more capable than modern software. So if you buy a decent GPU, you end up with well over a petaflop of computational power, especially in matrix-matrix multiplications, dense ones, or under structured sparsity where you have, like, half of the cells, up to half of the cells, zeros, while using modern software still sucks. Keeping a character can take over 100 milliseconds and there's this amazing blog by download that's quite popular, which shows how the computer latency changed over the last 50 years. And if you are too bored and don't want to read, the sneak peek is that it's not getting better. The amazing quote I really love from the blog post is, sorry for reading this from a slide, anyways, it's a bit absurd that a modern gaming machine running at 4,000 times the speed of an Apple II from 1977 with a CPU that has 500,000 times as many transistors can maybe manage the same latency as an Apple II in a very carefully coded applications if we have a monitor with nearly three times the refresh rate, which is kind of insane. So let's uncover the first part of the story, the brute force, a part of the search. So when you have a lot of vectors, the best way to search for the closest match, actually the similarity metric that people used to compare to that vectors is called the cosine similarity. It's also called the angular distance or the angular similarity. It's essentially the dot product of two vectors normalized by their lengths. So essentially you're only comparing the angle between them, but not the magnitude. In Python, in a dramatic Python, you would write it like this. You can write it in a slightly more efficient manner where you have just one for loop, it's rating over two arrays simultaneously. And the math equation for it is about as simple as it gets. Though the solution that you just saw is really bad. The solution that is better is on the slide. So if you open up the specs of most modern CPUs, let's say in this case it was an Intel Xeon subfire rapid CPU with 52 cores, 104 threads. You'll see that it has a bunch of different cool features. So in the bottom left corner, there's like a features table. And it lists all the accomplishments that semiconductor companies like Intel have done over the last two decades. And most of the software that you will use on your computer uses none of those. At best, compilers generate the appropriate instructions and 1% of the time from the research that I've done. In this specific case, if we need to compute the cosine distance between the two vectors, the kinds of instructions that you would prefer to use on this kind of machine, I used ADX512, especially the VNNI sub category of extensions, which allows you to compute the dot product and mixed precision of 8-bit integers, multiplying them as 16-bit integers and accumulating them into 32-bit integers. So modern CPUs have those very special tailored instructions that allow guys like us implement search at a very large scale. And on the right side, you see a snippet from one of my blog posts, one of the six that I wrote last two months, on how search can be optimized and be done efficiently. So this is annotated. This is not pure assembly. So you can see my comments in it. But supplementary to the assembly code that I had in the article, I also provided a few benchmarks. And the benchmarks would compare a library that is implemented with all those nifty features to SciPy and NumPy. So SciPy and NumPy are some of the most widely used Python libraries out there. They are not Python libraries in a pure sense, though. Because NumPy is mostly like a Pythonic wrapper around BLAS. And BLAS is basically an algebra subroutines or subprograms. Those are essentially some of the most optimized libraries in history. So essentially NumPy inner or NumPy dot, the function call that computes the dot product between two vectors, just redirects the function call down to BLAS. And BLAS is a C or assembly library. The original standard was implemented in Fortran like 60 years ago. But even comparing this non-pure Python library, but actually like C code or BLAS and assembly to the variant that I just showed with my assembly code, we get a performance difference that's here on the bottom right corner shows 18x performance difference on Apple tool chip. And on the left side table, 10x for Intel CPU. So this would compare to MKL, which is math kernel library by Intel. And the same applies to every kind of distance metric out there. For cosine similarity, it gets as high as 189 for Apple over there, third row. So all of this is available for free for commercial use under Apache tool license. Most likely you are already using it without knowing it because some of the big cloud providers rely on those libraries to implement the math efficiently. Needless to say, the same applies to databases. And if you have 1 million vectors going back to this slide, we see that SimSim is capable of doing up to 17 million operations per second. And if your data set is 1 million entries, this means that it would take roughly 70 milliseconds to compute all the distances on one CPU core. Most of the CPUs these days have a lot more than one CPU core. So probably the latency of computing against the entire data set will be lower than just a network call to some other index structure or a vector database. But what if you are actually beyond 1 million entries or you want to go there, you want to be future proof? For that, you most likely would need an approximate search solution. And approximate search solutions come in different sizes. I've implemented at least 15 of those. The algorithm or the family of algorithms that I like the most are called Approximity Graph Base. They're called HNSW. They are widely benchmarked. And you see a chart that is common to anyone familiar with search. It's a plot outputted by ANN benchmarks. ANN stands for Approximity Nearest Neighbors. There is a bigger version of this called Big ANN. Honestly, both of those benchmarks are fairly small. So ANN benchmarks are generally under 1 million entries. And they run search on just a single CPU core. And the vectors are not super representative of modern size of the vectors. Big ANN is a competition. Last year, I think it was a billion entries. And this year, they shrank it down to just a couple of tens of millions. In those benchmarks, you generally produce curves. And you align curves of different engines with each other to see how the ratio of recall versus queries per second compares to each other. So at the same level of recall, how fast they work. In reality, big part of those solutions are based on HNSW, Hardcore Navigable Small-World Graph Algorithms. Those algorithms will build proximity graphs, multiple layers of those. And during lookups, they will traverse the smallest graph first, find the most appropriate part, and then dive deeper. And they repeat the process recursively until they get to the bottom. Because this is a graph algorithm, when you start benchmarking it with default Linux tools, the output is identical to any kind of graph processing library. So 94% of the CPU cycles are wasted idle when the CPU is just waiting for the memory to be fetched. Yes, oftentimes we think that memory is fast, but it's not. Accessing memories can be as expensive as a couple hundred CPU cycles. It's cheap compared to SSD, but comparing to CPU caches, it's super expensive. And then somewhere in the middle, like third row from the bottom, you'll see that 2.4% of the CPU cycle, sorry, of branches were mispredicted. 2.4% may sound like not much, but it's a lot. Because the CPU's execution pipeline is like 20, 30 operations. And due to speculative execution, when a branch is mispredicted, the CPU takes the approach he thinks, or the line he thinks will happen. So it'll pick the branch that is the more likely one. But when it's mispredicted, up to 20 operation will have to be invalidated, which is quite a lot. Aside from those common bottlenecks and the part of highlighting the importance of memory efficiency of your systems, when I started looking into the open source vector search libraries, I noticed a lot of problems come into almost every piece of object-oriented software I've seen in my life. So I've been writing C and C++ for the last 20 years or so. And those languages, including the object-oriented languages, are super powerful. You can use them in the right way. But when performance is really critical, obstructions that are common to object-oriented programs are quite expensive. So those three snippets on the slide are from two different vector search engines, two of some of the most widely used ones. And on every single function call, on every single search, on every single layer of the graph, they will allocate multiple dynamic structures, multiple priority queues, with very irregular memory allocation patterns. Every one of those memory allocations is potentially tens of thousands of CPU cycles wasted. So I wanted to do something nice. So this year, I've open sourced a vector search library that doesn't have these problems. So I compared it to FACE, which is Facebook's vector search library. And when you use very small data sets, the problem is not obvious. So when you have like one to a million entries, FACE may actually work faster. But that's actually the scale where you don't need the index. The index matters when the number of entries becomes large. And in case of use search, like after 30 million entries, the gap in performance can be in tens or hundreds of times. So it's the difference between indexing a data set within a day or three months. It's a huge gap. So you don't want to pay another three months of DGX costs. Because at current AWS race, this would probably be at 150 grand or so, over 100 grand. Same for search. Even at 100 million entries, we sustain about 100,000 operations per second. And this is not the limit. You can actually do more. So within a series of articles, I compared our system to some of the commercial solutions that are being very actively promoted in the ecosystem. I don't put the name here, but you can guess it from the RABUS, I guess. So our solution might be like tens, hundreds, maybe 1,000 times more cost efficient. It's really hard to tell when the gap is so big. But aside from the benefits for business, the part that I'm personally passionate about is the benefit for science. So this year, one of the most exciting site projects that I had, I spent some time and some computer resources to build the largest public data set of chemistry. So I indexed 7 billion molecules. I produced 28 billion representations. And then we invited AWS to participate as well. So now AWS hosts the largest data set of embeddings ever produced, at least the public one, with 28 billion embeddings. Feel free to use it as free for anyone, free for all the scientists around the world. The library is also available under Apache Tool License. It's currently used by a bunch of different database companies like Clickhouse, out of the open source ones. It's used by different cloud providers. Last week, Google published a really cool project that also uses this library internally. It's also an open source project that uses this as a dependency. One of the cool features that it has, you can also JIT compile the metrics. So if you have custom vectors assembled from different models, you can have a very custom ranking system that you define in code. You don't need a DSL or domain-specific language or a custom query in JSON for Mongo for any other system. You can just write the code. I'll compile it somewhere there. And it will be executed with all those coolest assembly instructions available. The last, but not the least part, is smaller generative models. So when you train a model, oftentimes, at least a few years ago, people would train them with single precision floating point numbers. So float 32, 32 bits, four bytes. When some of the first large transformers appeared, their paper suggested that the models were actually trained in 16-bit precision, which just a couple of years ago wasn't really true, or was true only in part. The reason is that most modern frameworks, at least the frameworks at the time, didn't have good support for optimizers in anything other than single precision floating point numbers. So everything that's related to quantization and its availability in software is literally happening as we speak. So today, at least from the systems engineer's perspective, and the one who loves to implement AI models in assembly, the way I see the ecosystem is that 16-bit floating point numbers are kind of default. So if you use PyTorch or any other major framework, 16-bit floats are supported everywhere. Except for AI, though, most other places don't support 16-bit numbers. I noticed this recently. So you search, the vector search engine has binding to 10 programming languages. And many of them I implemented myself. And every time you implement a binding to a new programming language, you discover some of the limitations of every new ecosystem. And nor in C++, or in C, or in Java, or in almost any other language, half precision floating point numbers are not really standardized. Compilers are also very bad at generating them. Then we go to 8-bit numbers. 8-bit is available in the form of integers and it has been available for the last 40 years. Int8 is one of the most common data types everywhere. But on the GPUs, where a big part of inference is happening, the evolution was happening in a different direction. On the CPUs, we started with small integers and we slowly got to bigger integers than the floating point numbers. With GPUs, the evolution is going exactly the opposite way. So on NVIDIA's GPUs, both half, sorry, both FP8, which is like the 8-bit floating point number, and Int8 are currently supported on their recent GPUs. The last three years of GPUs support them. On the CPU side, yes, they are supported everywhere, but the question is, are they supported for matrix multiplications or not? So most modern CPUs are starting to introduce specialized assembly instructions and specialized hardware pieces, essentially part of your silicon, goes just to tiled matrix multiplication implementations. And having Int8 support on the CPU and having Int8 support within those tiles is not the same thing. So Intel CPUs just started getting those. 4-bit and 2-bit are not available there yet on the CPU side. And even on the GPU side, you have to be quite hacky and involved to get your model to 2-bits and not lose like 90% of your accuracy. The next question would be like, assuming different levels of quantization from 16-bit to 8-bit and 2-4-bit, what kind of models can you use and like how big they can be? So the numbers here are not exact and sometimes you would see repetition. And this comes due to the fact that in AI research and pre-training, we don't generally train every size or every possible size. We have certain thresholds that we preferred at a specific point in time, like seven billion parameter models, one half billion parameter models, a 13 billion parameter models, 70 billion parameter models. We don't really train like 24 billion parameter models or 50 billion parameter models, that's just uncommon. So out of the ones that perform well on current leaderboards, those are the sizes that you can expect to fit on more than GPUs. H100, A100s are the data center GPUs that are available on SXM form factor as well. Those are the most expensive ones with 80 gigabytes of HBM, high bandwidth memory. RTX, 3060s, 4060s, 3070s, 3060s, the ones that are quite common within laptops would only have eight gigabytes of VRAM so you can only fit models that are generally like three billion parameters, sometimes seven billion if you're really lucky. And then I would also keep in mind the numbers for the T4 GPU, which is fairly cost efficient in the cloud and the A10, which is a newer cost efficient version. On this slide, we would compare the inference engines because like if you go this way, you quantize your model, the question is like which piece of software can run it? There's always an option for vanilla PyTorch, vanilla TensorFlow, vanilla Jax, those are the frameworks often used for training. But when you wanna do inference, there are a few other options as well that are less known. One of them is GGML, which is more on the library side of things. So it's not strictly true, but when I look at it, it seems more like a mixed math precision library rather than AI framework. There is a set of AI frameworks that most popular these days or the most trending one would be Tiny Rad. And out of compilers, there's this company modular that is well known due to module programming language and MLIR and their relationship to LLVM. So in the bottom two rows, I've kind of showed what those projects are known for. So GGML is mostly known for Lama CPP and Whisper CPP. Those are quite popular projects on GitHub if you wanna run inference on some of those models on your laptops. Tiny Rad is mostly famous for the company that backs it, call my AI, and the self-driving system that are designing the open pilot. And as for modular, again, they are well known for Lava Modjo and the Modjo programming language. Using those will give you some advantage. Generally, the gap will be like not too big. So like between multiple frameworks like this, the performance difference can be like two X, three X. It can be crucial for some of the businesses, not always important to tinkerers. So this is the chart that I found on Databricks's website, it's by Mosaic. And they show how the throughput differs between different sets of GPUs. We may not be in a position to dive too deep into those, but there is a point I wanna make in this section that small models perform surprisingly well. If you tune them the same way the big models are tuned in the reinforcement learning setup, as well as like with DPO and some of the other modern approaches. So if you go to LMSys, which is like a very well known online arena that compares the performance of different AI models, and it assigns ELO ratings to them similar to chess players, you'll see that among the best performing models, there's Starlink 7B and OpenChat 3.5, both of those are 7 billion parameter models, that in some cases outperform some of the largest, most expensive models out there, including GPT-4 and Claude 2. So this doesn't include GPT-4 to abroad, but Starlink 7B outperforms Claude in both empty bench and MMLU benchmarks. And MMLU is a very broad, sorry, it does not perform on MMLU, but it's surprisingly close. MMLU is a set of benchmarks, about 50 of them, actually even more than 50, that compares the language understanding of different models on a dozen different tasks. On, if we dive into MMLU specifically, there are two families of models that I personally love, and we use them a lot in my company and in some of the projects that we're building. So Microsoft Fi, they had Fi 1 and Fi 1.5 models they previously released, and yesterday they released Fi 2. So luckily I didn't prepare the slides until today, otherwise the content would have been outdated. And as always, our world is not unipolar, the progress propagates through the world quite fast. And there is also an interesting model from Alibaba, Quen, it's 1.8 billion parameters. So that's not even seven billion parameters, both of those models. So like 5.5 and Quen 1.8, they run one and a half billion parameters, Fi 2 is close to three billion parameters, and there will soon be a model by Unum, by my company. That is in the same domain, like super easy to run, you can also run it on mobile devices. So you search right now, might even be installed on your mobile devices, I'm not sure. But there are very few models that can fit on mobile devices. The models that we will hopefully release in the generative space will be of that nature. Previously last year, we've released a set of families for search. So we're not going to cover this topic deeply, but when you do retrieval augmentation, and you have a set of models that you feed the information into, the one question that may arise is like, where do you get your embeddings? This question may not be as crucial as it seems if your generative model itself is like a billion parameters. So you can just take the hidden state of the generative model itself and use it as an embedding for search. That's actually like a surprisingly good approach, but fairly uncommon. But if you just need a very lightweight efficient model for embeddings themselves, there are a few models that I like and use extensively. One of them is by Microsoft again. So Microsoft has open sourced a lot of really high quality models. It's called E5. It's available in different sizes, small and big, English oriented and multilingual. But even though there is no shortage of models for textual setting, there are very, very few models that are multimodal. So if you need embeddings that map, text and images, for example, into the same vector space, you are generally out of luck. The only common model in that space would be open clip. It's fairly beefy. It's quite outdated. It's a few years old. And about a year ago, we've published this model under title, beating open AI clip with 100 times less data and compute, which is only partly true because we don't really beat the open AI clip on all workloads. But frankly, we don't care for all workloads. We only care for search. And it's also not fair because of the compute size. It's not 100 times less compute. It was 300 times less compute. So our models aside from being smaller and demanding less compute resources to train, they produce embeddings that are three times smaller in terms of their dimensionality than BERT and clip, obviously. They are only 256 dimensions. So you can build search systems that scale properly. And because we have an international background and an international team, we wanted every language to be equally represented. So most other companies, they would either work on the indexing side or the embedding side or the dataset curation side without seeing the whole picture. But once you dive into the dataset, such as Common Crawl, that is widely used for those LLM pre-training tasks, you note that Common Crawl is about 300 terabytes of data. 52% of that is English, 6.4% is Russian, and every other language is successively smaller. So if you want your customers to be satisfied despite where they're coming from, be it China, Korea, Armenia, Israel, France, Germany, Portugal, Thailand, any country, any language, you should look into those custom models that are trained on a much more balanced datasets. And in our case, we also want to be diverse in more ways than one. So aside from CPUs and GPUs, we also added Graphcore IPUs to the mix, benchmarked those and added a few other uncommon hardware vendors to the list. So with this, I'm gonna end. So here are some of the projects that I really love tinkering with. Some of them are my own. So feel free to open pull requests. I'm generally replying within a day. And a few of them are by Meta and Microsoft. Also open source, under open licenses. So definitely worth trying out. Thank you. Any questions? I think you can get a lot done with less than a billion. I think we will be slightly bigger than a billion, at least the next checkpoint that will come out. But yeah, I'm pretty sure that we don't need this many parameters. We are just, as a society, as an AI research community as a whole, we are lazy and somewhat stupid. So we compensate our inability to innovate with bigger models. It's a decent approach. I'm doing it myself. Like anyways, everyone wants to be the first. Everyone wants to have the best model. But I'm pretty sure that very small models can be as efficient as the big ones, as accurate as the big ones. Yeah. And you don't wanna have like a DGX running for every wearable device connected to internet. This is a very easy way to run out of energy. And GPUs. Any more questions? Okay, feel free to grab me after this. Thank you.