 Hello, everyone. Can you hear me all right? Thanks for coming. I'm Montana. I'm the co-founder, CEO of Postgres ML. And I'm here to talk to you today about doing machine learning inside the database. So some context on why I believe that the database is the right place to push this computation, why you should add GPUs to an already overloaded instance, potentially the hottest, most resource contentious place in all of your infrastructure. I'll tell you the story about what I've been doing for the last 20 years in software engineering, machine learning, natural language processing, and company building. I open sourced Instacart's open source platform called LORE about six years ago. We were running deep learning models in production on TensorFlow 0.4. Maybe not the best idea, but we learned a lot. And a lot of those learnings from those early days have informed how I build machine learning infrastructure now. A lot of the scalability issues that we hit when we went into COVID in 2020. And we were already a multi-billion dollar revenue company. And we started doubling on a weekly basis in terms of traffic. You can imagine that every infrastructure resource we had started straining. And our machine learning infrastructure, which I had helped build out a lot of, was one of the worst offenders. So a lot of, ideally, naturally, I'm a perfectionist. I like systems to be optimal. I think that's why machine learning and software engineering appeal to me personally. But I've also learned a healthy dose of pragmatism and that you need to ship solutions as quickly as possible with the least amount of effort as possible to get close to that ideal end state. And you can iterate over time. Software engineering and machine learning, especially, these things are never done. These are evolving systems. They're living systems. And maintainability, manageability of these systems is critical for success. Just to give you the meat of the presentation upfront on, are we totally crazy for doing machine learning inside the database? Is Postgres fast? Is a relational database capable of any of these workloads? If you actually compare the end-to-end performance of a Postgres ML system against what I would call to get a Python microservices architecture or a TypeScript microservices architecture, just calling out to open AI to generate your embeddings for your RAB application during query time when you're going to generate that chat prompt, it can be 10 times faster if you do that inside the database. With an open source model that you control, you don't have to wait for your quota limits on open AI. You don't have to give up control over your prompt quality. And these open source models are actually higher quality. They're beating out open AI's Aida embed 002. And they have been for the last six months if you look at all of the rankings for quality in these models. Similarly, you look at industry leaders like Huggingface and Pinecone for a vector DB. When you actually want to generate your embedding with a Huggingface endpoint call and then you want to fetch a bunch of nearest neighbors from a Pinecone database, you're making two remote data center networking calls. And if you're doing that over and over in your application basis, it's significantly slower. There are other, it's funny to me when people put DB in the name of their product and they're not actually a database. We've seen several of these lately. And they are sort of thin wrappers around databases. But overall, Python microservices even built on top of highly specialized databases, whether that's Redis, which is an incredibly respectable key value store. Like I have a lot of love for Redis and what they do. I have a lot of love for Cassandra and Memcache and all of these other databases. And we were running all of these other databases as various feature stores at Instacart when COVID hit and our growth started exploding. And inevitably, what we found is that most people can just deploy Redis. But few people know whether or not they're running Redis in a persistent safe failover mode or not. And what's going to happen when their Redis cluster actually hits 100% CPU usage or 100% RAM utilization. The short answer is it will crash. You will lose all of your data. You will have to backfill your data from your primary Postgres instance that was actually authoritative backing store or Druid or Snowflake or whatever. And that service will then be down for however many hours or potentially days it takes you to refetch and reload that data. So one thing we learned is that just because you can stand up a database does not mean you can operate a database, especially under duress. And so that's one of the important things I think about these days when building large scale systems is do we actually have the engineering resources on our team to manage more than one database? Or do we have the resources to manage any databases at all? This is a pretty common thing to outsource. There's a reason Amazon RDS is what it is a lot. It's very common to do this. But for machine learning applications in particular, I think there's a few considerations we need to make when we choose a database. Obviously you want high client concurrency, which really just means you're going to need horizontal scalability at the outset. Cassandra is a stand-up example of horizontal scalability for everybody at this conference, which means ultimately you do need to consider sharding your data. A lot of people think sharding is hard. It's really not, especially if you understand your data model. Most people have a user ID in their applications. You can generally shard by user ID, and that's the only consideration. Like that's the end of the discussion. It's not always true. There are follow-up on considerations on what do you do with cross-shard joins, for example. How do you replicate data in your cluster? Those are thorny problems that you can solve at scale once you're making a billion dollars in revenue and can hire the engineering team to solve those problems. They're not necessarily things you need to consider upfront. But you do want millisecond read times for any interactive application. Data warehouses are off the table because of this snowflake, redshift. All of those systems generally can't support high concurrency. They generally can't support low latency. And finally, if you really want to do online machine learning, you need session level data streaming to your primary feature store. And again, this takes data warehouses off the table for the most part because those suffer really bad performance under incremental rights. So almost certainly you're out of the OLTP world into the OLAP world. No, I just said that backwards. Sorry. You want a transactional database, not an analytical database, which is backwards for a lot of organizations. A lot of organizations originally institute machine learning under the data science team. The data science team is the one that has created this analytical database, this data warehouse setup. And then they're tasked with, hey, can we bring ML online into the application? And a lot of KS and Mayhem and Suze as they try to actually make those things fast. Because it's just not what they're designed to do. If you're doing machine learning and somebody ever asks you to build a feature store, all they're asking for is a database. A lot of them, all they need is key value access. So Memcache is fine, Redis is fine, as long as you know what happens in a failure scenario. But more recently, people want these hybrid search apps so that they can do keyword search and they can do vector search. People like JSON document filtering metadata access. So having a document type with some extended capabilities is nice. That will make your applications developers life a lot easier and it'll make your applications a lot more efficient. The more data manipulation, transformation, filtering that people can do in the data layer without having to pull all those documentations back to the application layer. And finally, it is actually pretty nice if you can generate embeddings in the database. It is really nice if you can run LLMs in the database. There's only a couple databases I know that can do this so far. Postgres and Elasticsearch. And I'll talk about that a little bit later. That's really the whole point of Postgres ML is adding these capabilities to Postgres. Finally, I think I've touched on what it means to be a proven database. There are a lot of brand new vector databases, much newer than Redis, much newer than Cassandra, much newer than Postgres. I think people are currently learning a lot of operational lessons about data loss for those kinds of databases. And it's gonna be very interesting to see how that ecosystem evolves as a single index type is sort of added to every other database that has already been proven out at scale in the engineering community. So I've made a quick grid here. All of these grids that you'll ever see are kind of rigged by the creator or the presenter. You can ignore the right hand side of this graph. This graph, that's your vector ML embedding and LLM operations. We've specifically added those operations embedded inside Postgres, so it's taken the lead there. But we could have added those to Elasticsearch. We could have added those to Cassandra. These are all open source databases. I'm a big proponent of open source. And I'll get to the reason why I chose Postgres specifically to add this functionality to because I believed it was the best platform for machine learning database. The answer to that really is the two left hand sides. And if you're a database person at a database conference, I think that these are the most important consideration for all of you long term, much more so than vectors. Like, does your database do vector search? Every database is gonna do vector search in the next year. I mean, I say every, there's gonna be like three or four holdouts out of the 400 on the market that don't implement it because they're not continuing development or something. But I even think that most people will eventually do embedded ML. Most people will do embedded embeddings. Most people will do embedded LLMs. It makes a lot of sense when we get down into it. But who, which databases support sharding very well? Which databases support joins? These two questions become really important for machine learning workloads. To explain that a little bit, I'll create a toy example. And this is an example that's actually at the heart of Instacart's business model. Is it the heart of Amazon's business model? It's at the heart of Shopify's business model. You have a bunch of retailers. You have a bunch of products that are sold at various retailers. And we call that an offer. So an offer just relates to a product. It relates to a retailer. It has a specific price. And it has some other metadata like whether or how many are in stock or not at that particular retailer. And you keep account there and you decrement one every time you sell and then your market is out of stock, et cetera, et cetera. This is a normal form of a relational data model. It's an interesting non-trivial data model. You'll notice that the product has some nested JSON document as it's machine learning features. And those are actually the problem. Those become a really big problem. And I'll get into why in a bit. But we can compare that normalized form to this denormalized form. This is the NoSQL form of the same schema. What you do is you copy the product and you copy the retailer into the offer. And that way you don't need to join when you're reading the data. You can just select the offer by ID. You can get the retailer information from that offer. You can get the product information from that offer. You can get the machine learning features from that offer. All of this nested data is right there. You don't have to worry about the cost of joins anymore. And that's great because A, it saves you compute at read. But B, it makes it easier to shard. Joins make sharding hard. So this is actually really cool. And this is why NoSQL is better than SQL. But there's one problem here, which is if you'll notice what those machine learning features are. Machine learning features are typically statistics that are recomputed periodically, perhaps after every single purchase of a product. I need to recompute what search terms converted for that product. In this case, the product is Coca-Cola and maybe somebody searched for Coke and then they clicked on this product and bought it. And so now the conversion rate for the keyword Coke on this product has gone up some fraction of a percent. If I have to actually go update the global conversion rate and the term conversion rates for all of the retailers that sell Coke. Now remember, this isn't any more one document in a normal form. This is every single offer of Coke. So at Instacart, for example, there may be 100,000 different retailers selling Coca-Cola with this same offer. And if that one checkout happens, that one purchase happens, we have to go update 100,000 search terms for Coca-Cola because it's been denormalized into our Cassandra or our Elasticsearch clusters where we were originally keeping data like this. Those technologies are amazing. They scale incredibly well. They scale incredibly far. Up to the point where your CFO eventually notices or your board or in the case of Instacart's S1 IPO prospectus, all of the public markets, notice how much you're spending on IT infrastructure. That's not a good thing. So if you wanna do real-time machine learning like this, you have to normalize your machine learning data. You have to pull it out of this nested form and you have to join to it at read time. It's really the only sustainable, scalable way to keep this up to date. Otherwise, what you'll end up doing and what we ended up doing as a compromise before we had a better way of doing things was you said, you tell the machine learning engineers, you know what? We're not gonna update statistics in real-time. We're gonna have a weekly batch job and we're just gonna recompute the statistics once a week and you're just gonna deal with stale data. And then they're gonna say, well, every time we add a new product to the shelves, it takes a week for us to get any statistical information on it and you say, well, okay, we just won't sell that product for the first week. And then some product manager is like, well, this is on special. It's only offered for one week a year. It's now out of season. There are lots of compromises that get made to keep machine learning models running in production over and over again and all of those inefficiencies really become business inefficiencies and cost you either top-line opportunity costs or they impact your bottom-line revenue through data infrastructure costs. And those conversations are actually really unpleasant to have when you're either the data scientist advocating for your model and saying, I need fresh data. I can increase the business value of this data to the company if you can just get me this data sooner. I've worked hard doing all this research. That's really frustrating. At the same time as if you're the machine learning engineer and you're like, oh, but our Cassandra cluster is costing $100,000 a month and we have a directive from the CFO to eliminate usage and you are the biggest consumer. So that makes nobody happy. So finding ways around that is really important. If you think I'm exaggerating and if you think that the world is gonna get better because Moore's law is eventually gonna save us, I have really bad news for you. These models are getting a lot more data hungry with this vector data. The models themselves are getting bigger. The amount of data that they want and need access to and want to consider. You show a machine learning engineer, I want to consider some vectors and they're like, can I rank 10,000 of them? And this is 10 kilobytes of UTF-8 data for one single vector right here. And then all of a sudden they want 10 kilobytes times 10,000 pulled into their Python-Panda's data frame, which is, I think that's 100 megabytes of data and then Python-Panda's is gonna double that because they do an inefficient copy. So you need like 200 megabytes in your Python microservice process that you're gonna have to load. And that's gonna be really slow and they're gonna say, oh, I've just blown my latency budget. How about 100? Okay, maybe 10. I just need 10 vectors now. And again, it's the same conversation, it's the same problem over and over again that you actually cut the efficiency of these systems at multiple levels. So we ended up going a different route at Instacart. We have, I think, several really innovative pieces of technology that have come out of our research there. One of them is PGCAT created by my co-founder. PGCAT is a proxy pooler that sits in front of a sharded cluster of Postgres databases. If you accept that you're gonna do more work in the database, you absolutely have to be able to scale the database horizontally. And what that means is you put a proxy in front and then sharding gets really easy. And you can actually shard, you can replicate and you can vertically scale these machines. Commodity hardware now offers 256 cores and nearly a terabyte of RAM in a single machine. This is actually a pretty enormous piece of hardware. So vertical scalability for a lot of companies. There are multiple reports of companies getting to billions of dollars in revenue on a single Postgres instance. But long term when you're a multi-billion dollar company, you will want multiple Postgres instances. You will want sharding with your application. So PGCAT makes this transparent. You can shove it in front of your single Postgres primary. You can start adding shards and replicas behind it and your application engineers no longer have to worry about that big mess. Similarly, the way Postgres works internally for any single one of these instances is that every connection to Postgres is its own fork of the Postgres primary. This forking to establish a new connection is relatively expensive off operation. And so PGCAT, by sitting in front of the database when a new connection comes in, it can reuse existing connections. And so it can actually alleviate a lot of load on your primary by managing the connections for it and keeping long-lived connections open in the background and just reusing them for short-lived front-end clients. This is even more important when you get into this serverless world of JavaScript applications where they're constantly spinning up new Lambda processes and shutting them down. You don't wanna have to create a new backend Postgres connection every single time. Finally, what this allows us to do is this allows us to keep a GPU cache long-lived on the Postgres server where the models and features that we're storing in Postgres can actually be moved to a shared pool by the single postmaster where you actually have hundreds of clients able to access these things concurrently. So you have a mature, robust, serving infrastructure that's been battle-tested by hundreds of organizations. It's an efficient protocol. It's an efficient memory layout. Everything we do in Postgres ML is written in Rust, including our extension for Postgres. We do call into Python for the latest hugging-face transformers where the research papers have only released Python code that make these models work, but by and large, our data operations and manipulations are as efficient as possible in memory. We never have to go over the wire in Postgres ML. Everything can happen on a single server, on a single shard, in a single memory space, including the GPU inference. So I wanna talk about, when we say end-to-end machine learning, like what does that actually mean? This is a diagram put together by Andreessen Horowitz. They're a famous venture capital group here in Silicon Valley. And so each one of these boxes represents a function or operation or server that you need to be running and consider when you're actually doing end-to-end machine learning in real time in production. There, you'll recognize lots of names in lots of these boxes, but there are some obvious ones, like a feature store and a feature server. There's nuance here. Not all of these terms are entirely well-defined. Some people at conferences speak about them as if they are, but when you actually get into any particular organization, you'll realize that there is some ambiguity. But it is nice to have a roadmap laid out like this. The only takeaway I have from this slide is that Postgres ML needs to do all of these things and does all of these things. So while many solutions are point solutions, our goal at Postgres ML is to take raw data from data sources, store it in a single database or a sharded data cluster, and then be able to connect clients. And the great thing about connecting clients is we don't have to support every single language and every single connector because it's Postgres. It's already done. When you have a 35-year-old database, we don't have to worry about integrating with every piece of technology out there because it's already been integrated. And there's a whole host of other solutions that operate on top of Postgres that provide good, solid workflows. If you want data pipeline management that we don't do that, well, there's great ones for Postgres that do. Like you can use Airflow, you can use DBT, you can use any of the solutions out there that I generally prefer open source ones, but lots of proprietary vendors also support and connect to Postgres. If I haven't convinced you that some of the capabilities of Postgres are interesting and powerful, from a strictly database perspective, I think there's an interesting machine learning perspective here is that all machine learning models are a generalization of the underlying data. They should ultimately be a compression artifact of all of the data that they're trained on. So useful models that serve some purpose in the world. They should be small relative to the data that they were trained on and they should be modified infrequently compared to the data, which means that the data itself is bigger and it's more dynamic. And when you have something that's bigger and more dynamic, you want it to be, there's this concept called data gravity. You don't wanna be moving the ever-changing mountain of data to your application layer. You wanna move that small little ML application and these small little ML applications, they are really just a few function calls, for the most part. Now, what's going on at Meta? What's going on inside Google? What's going on inside some of these fang companies or those people producing foundation models? Yes, there are research teams and yes, they have several hundred lines of code that they publish with their paper for their innovation. But those several hundred lines of codes are wrapped up and encapsulated behind a single entry point function call that you pass your natural language string and out comes an answer. And so it's much easier to take that model that comes out, now they're coming out once a week with a new foundation model, but take that once a week and put it in your database rather than take all of the data that's constantly changing and bring it up to your model on every user request. So I wanna rag as a hot topic here today. A lot of people are talking about how to do rag. Most of the ways to do rag, whether you're using LangChain or Lama Index or any of the other homegrown rate ways. Now you start with a user query that's coming in from the left-hand side of this graph. You'll wanna do both a SQL query against your data store. I have Mongo in this slide deck because it's popular, but Cassandra is good too. You'll also wanna send that natural language incoming query to HuggingFace, have it generate your embedding so that you can send your embedding vector to your PineCone vector database. You'll do your nearest neighbor lookup, get those back in LangChain. Hopefully you've gotten back your answers from Mongo at the same time. Then you take that, you generate your prompt using your PyTorch pruning model because prompts only can be a certain amount of context. You get that back in your LangChain. You send your LangChain over to OpenAI to generate your final response. This takes a second or four to run this whole mess. We support all of those same equivalent open source models and functionality inside Postgres ML. You can do all of this in a single query in Postgres ML and it's much faster, it's much simpler. It's a lot less infrastructure to manage. It's a lot less surface area and networking calls for things to go wrong. So the way we do this, Postgres ML provides three functions that cover classical machine learning. There's a training function, there's a deploy function, there's a prediction function. I'm gonna zip through this because we've only got a few minutes left and I wanna leave time for questions. But there's lots of documentation on these functions. These functions take a lot of arguments. You can train with over 50 different algorithms. Deployment helps you manage the life cycle of as you train new models and your data changes, you need to be able to deploy which model is active for any particular project and predicting is how you use a model. I do wanna, I'll cover these. These are the hot new LLM functions. Anything that you would call Huggingface Transformers for, you can call inside of your database. When you call pgml transform inside of your database, you pass it a Huggingface Transformer model string. We will go, your database will go download that from Huggingface, it will cache it in RAM, it will move it into the GPU if you have a GPU in your database. It will then pass any arguments from any tables or any other inputs in your query to that model and give you that model back. You can also do fine-tuning in the database and of course, generate embeddings in the database for your RAG applications. I'm gonna hop over real quick. I just wanna show you what this looks like in practice. This is our homepage. This is a little SQL query. Let's see if I can blow this up, so I'll look. This is a little SQL query. In this case, we're using the transform stream version of the call. It's a modification on transform. The web has had web sockets to let us do these dynamic updates for the past five years or so. Postgres has had cursors for the past 20 years or so. They do basically the same thing. As you can see, we're using a model from the bloke if you're familiar with the current LLM ecosystem. This is one of the new fun, mistral models. We can pass it an input with just a phrase of AI is going to, we can tell it how many tokens we want back. We can hit run and it's pretty quick that mistral can generate a bunch of output here. The output to me seems pretty reasonable. Obviously we can change the string. These are just inputs. I like open source AI better than I like AI. I like that it's response to open source AI is more positive and even cooler. So this is really fun. We have a bunch of these different generation tasks that you can do. Again, we support pretty much everything HuggingSpace supports out of the box as a wrapper on top of that or an integration point. Another thing that we do, if I sign in real quick, let's see how it might actually be signed in another tab. Yeah, so I've got a database running here. I'll show you real quick some embedding stuff, some RAG application stuff. What I did before this talk is I downloaded all the Amazon reviews. There's like 5 million reviews from Amazon with movies. I can click run here on this little notebook. You'll notice that our notebooks are SQL. Oh man, with this exploded font. These rows are not looking good but I'll try and find one here. Here we go. This is an example of the Amazon data. I'll scroll on past that. Get down to one of the more interesting things. This is an example called to the pgEmbed query. You can see that we can pass it a transformer model to actually perform the embedding with. We can pass it some input text. We can run this and we can get back an embedding for that from the database. So in this case, you don't have to worry about your inference service. You don't have to worry about your model store. You don't have to worry about any of those other components that I showed you in that Andreessen Horowitz box. They're all right there inside your Postgres database. So you can access them all with a simple query. And once you can generate embeddings inside the database like this, if you're familiar with Postgres and you're familiar, well you can see how big these embeddings are. But if you're familiar with common table expressions and the real power behind composable queries, then you can start to build a full program in a single SQL statement. So we'll take that embedding that we had generated in the previous example. We're not actually going to return that embedding because that's 10 kilobytes of data that's useless to a human being. We'll leave that in the database as a common table expression. But then we can do a nearest neighbor lookup against five million other embeddings that we've already computed in the database for all the other Amazon reviews. And what we want is we want the Amazon movie reviews that match closest to best 1980s sci-fi movie. So we can run this query and you can see it takes about 70 milliseconds to return the top five movie reviews for out of five million Amazon movie reviews. And you can see the review body is best 80s sci-fi movie. Best 80s sci-fi horror movie is the blob. So you can actually see that rag applications are pretty simple to build with just a SQL query and a database if you just add two or three function calls to them. It's super fast, it's super efficient. It's all yours. This is 100% open source. You can take it and do whatever you want with it. It's MIT licensed. We also have hosted versions if you're not as comfortable hosting your own database. But I think I'm at time now, so I'll stop there. But if anybody has any questions, happy to step over to the site and answer them after this talk.