 Thanks for coming. And today I'll be talking about the evolution of Mildus, which is a cloud native vector database. Now for those of you who are unsure of what a vector database is, no worries there. I'll be getting into that a little bit later on in the presentation. But just to introduce myself real quick, my name is Frank. I'm currently an architect here with Zillis and we're a startup based out of Silicon Valley. And we do, as you could probably guess, a vector database as well as the broader vector database ecosystem. So quick overview of the contents of what I'll be talking about today. I'm first gonna go over what unstructured data is. And this ties in very, very deeply with what the concept of vector database is and what it's meant to do. I'll also talk a bit about embeddings. So sort of vector databases really lie at the intersection of data infrastructure and AIML. And embeddings are actually very important to what we're doing here. Then I will go into an overview of vector databases. I'll talk about the Milvus architecture as well as differences and sort of the generations of Milvus. And then I'll go into some real world use cases as well. So without further ado, let's dive right in. All right, so what is unstructured data, right? Unstructured data is essentially any data that is not conformed to a predefined data model. And just so that it's a little bit clear, it's always helpful for me to give examples of what unstructured data exactly is. So pretty much all human-generated data, images, video, audio, text, natural language is unstructured, right? It can't really be stored in a table-based format or in, let's say, an object database or a wide-column store like Cassandra. And it can vary in size, it can vary in shape, it can vary in structure. So you can imagine an image versus a paragraph or a document. These are large binary blobs, but they are read very, very differently from disk. But there's also a lot of more interesting examples of unstructured data as well, right? So some of the ones that I like to come up with are user profiles, geo-spatial data, map data, even protein structures, molecular structures. These are all examples of unstructured data, right? So you can really think, if you expand your horizons a little bit and try to think about the different types of data that we store today, you will actually find that most of it is unstructured. And this really gets into the evolution of data itself. So when computers first came to be databases and data storage solutions were very much a big part of a lot of engineering challenges. And a lot of these data storage solutions started off with structured data. So you can imagine numbers, characters, and then they went on to maybe some text. If you think of, let's say, an employee database, maybe storing date of birth, name, title, role, so on and so forth, we then, as the industry grew, evolved, as we entered the era of mobile computing, as devices became smaller and smaller and capable of storing and transmitting more and more data, we ended up getting more complex databases and more complex types of data as well. So you see in the middle here, this is an example of a JSON document, something that you might store on MongoDB. But then over on the right here is unstructured data. And I think this really begs the question, how do we store, search, and index all of this unstructured data? How do we do it when there is not a particular model that we can use to store all of them? This is very much an open question. And the way that it is typically done today, and this really ties in very much with the deep learning and machine learning boom, is that we take unstructured data and through deep learning models, we turn them into vectors called embeddings. And these embeddings are essentially very, very powerful semantic representations of your input data. So you can imagine if I have an image, maybe this image could be like, let's say, 4K image or even a very small image, all regardless of the size of the image or the contents of the image, I can distill that information into a fixed length vector called an embedding. And this is really a very powerful method for us to be able to understand a variety of different types of unstructured data. So this is just an example to really show the power of embeddings and how they really tie into the idea of vector databases. This is a visualization that I did a while ago with TensorFlow Projector of various words. So these are word embeddings that are projected into this three-dimensional space. And you can see at the very top, there's some more scientific words, excuse me, so chloride, glucose, ions, proteins. At the bottom, you see a lot of names, Phil, Gary, Tom, Chris, Frank. On the right, you see a lot of functions, you see a lot of processes, and so on and so forth. And this really goes to show you how taking these very, very high-dimensional embeddings and projecting them into this 3D space it shows you that these are very, very powerful ways to represent data that it would otherwise be, perhaps pretty difficult for a computer to know or to understand. All right, so now that we've got unstructured data and really given a very, very brief introduction into embeddings, I wanna talk about how they're used for semantic analysis, right? And this will tie in very, very closely with what a vector database is and what it's meant to do. So the definition of an embedding is a vector representation that encodes the underlying meaning behind unstructured data. And we really saw this with this particular slide right here. But moving forward, embeddings are much, much more powerful than just saying, okay, let me take something turning into a vector, and that's it. Here we see an example of multimodal embeddings. The idea where I can take natural language and an image and embed them into the same space. So while you see here in natural language, I have a sentence that's how he purchased on Tree Branch. Here you see an image of how he purchased on Tree Branch, right? And I think for those of you who stay in touch with a lot of the advancements in AI, in ML, some of the more recent models, stable diffusion, a lot of these models based on thermodynamics, diffusion models, these are very much a very great example of embeddings in action, right? So embeddings really form the core of machine learning models, really form the core of unstructured data analysis. And one probably lesser known way that embeddings can be used is, I like to call sort of embedding arithmetic. So it's pretty difficult for me to describe just in words, but you can see if I were to let's say correlate the relation between man, woman and king, queen, those are directionally, those are, the semantics are also, the directional changes are also a part of the semantics of the embedding itself. So these are really, really powerful ways for us to be able to represent our data. And likewise, right? If I were to try to look for the nearest neighbors of a particular input piece of unstructured data, I would be able to do so with embeddings. So you see the top row, given a query image, I can search for its nearest neighbors. In this case, it's all plants. The middle row is mountains, landscape, and the bottom row is forests, right? So, how do we do this particular type of, let's say nearest neighbor search is through something called vector indices. So similar to where you see these traditional relational databases or, you know, object databases, they have their own way of indexing data, vectors are also indexed in their own unique way as well. And that's how we're able to do very, very fast search across massive quantities, billions, maybe even one day trillions of vectors, right? So you can generally think of them as being divided into four different categories of indexes that are hash-based, tree-based, inverted file-based, and graph-based indexes. And these, you know, I won't really go into the nitty-gritty of these. Each of these individual sort of vector index types probably warrants another, you know, half hour to one hour talk, but just know that there are many different ways that we can index our embeddings and use them in production systems. All right, so now that we've gotten unstructured data and embeddings out of the way, let's talk a bit about vector databases, right? So now we're really into the exciting part. So what is a vector database? Given what we were talking about earlier, I think it's very clear now when we say a vector database is a database purpose-built to store, index, and search large quantities of embeddings. And by extension, large quantities of unstructured data as well through their embedding representation. Now, I do wanna say, talk very, very briefly about vector databases in production before I move on to talking about some of the architecture and the evolution of the list itself. Now, this is a very, very simplified model of how, you know, MLOps and data pipelines are used in production, but I think it gives a general idea of where vector databases fit into the ecosystem, especially as a very, very key piece of underlying data infrastructure. So you see up here in the light blue is sort of a lot of training procedures, things that happen maybe in the background. And then in green, I have things that you would wanna serve in production. So data flow, I have a way to serve the models and then once those models are in production, I have these embeddings and those embeddings go into a vector database for further processing. And that allows me to do large-scale semantic analysis of unstructured data. Okay, so now that we've gotten that out of the way, I do wanna talk a little bit about Milvus itself as well. So Milvus supports a number of hardware accelerators, taking a step back, because Milvus is very much a database centered around AI, centered around ML, centered around embeddings. It is also very inherently compute-intensive. So being able to support a lot of SIMD instructions, such as SSE, AVX-CVX 512 on CPUs, in addition to accelerators, GPUs, FPGAs, and later on, TPUs, MPUs, so on and so forth, these are very, very important features I think all vector databases should have and Milvus has this particular type of support. We also support a lot of key database functions. So functionality that you would want in a traditional database, such as caching, replication, horizontal scalability, these are all really, really key components of Milvus. There are also multiple options for index and similarity metrics. We support FACE, ANOID scan and so on and so forth. We also support Boolean indices, which is also a fairly unique feature of Milvus. And on top of that, we support many distance metrics as well. So going back to this particular example right here, we have these indexes, but the reality is when we are really trying to understand which are my nearest neighbors to a particular vector, I have to have a distance metric to be able to be able to filter out those nearest neighbors. So, okay, so again, coming back to the slide, right? We have multiple options for similarity metrics or distance metrics as well. So Euclidean distance, we have cosine similarity, dot product if you have normalized embeddings, and you select the dot product as your similarity metric that ends up being pretty much cosine similarity. It actually is equivalent to cosine similarity. And again, because we have Boolean indices, we have Boolean similarity metrics as well. And then we also have a number of SDKs to round out the Milvus ecosystem. So there's Python, Go, Node, Java, and I believe a C++ SDK is being worked on right now as well. And another key feature of Milvus, and again, I'll talk about this a little bit more once I talk a bit about the evolution of Milvus itself is that it's cloud native. It's Kubernetes native with deployment through Helm. We have native S3 or S3 like support using Minio. A lot of cloud vendors these days will actually use S3 as really the underlying object storage endpoint. And we take great advantage of that, right? Milvus takes great advantage of that. Milvus is also fully distributed. So it is highly elastic and horizontally scalable. And the way that it achieves this is by disaggregating read, write, and a lot of these background services by disaggregating these planes in addition to disaggregating storage and compute and stateless from stateful. And I'll get into exactly what that means in the next couple of slides. I also wanna talk a bit about sort of rounding out the vector database ecosystem. There is vector data ETL as well. I won't, you know, this really isn't the meat of this particular talk, but I figured it was worth, it's worth it to mention this as well because it is such a core part of the vector database ecosystem. So we have a sister project to Milvus, a younger sister project to Milvus called TOEI. And TOEI is essentially turning unstructured data into embeddings. This is built on top of a variety of open source libraries. So Tim, TorchVision, DPhase, Transformers are an example, are some of the open source libraries that TOEI is built upon. TOEI and Milvus are both open source as well. TOEI also provides over 400 embedding center, what we call pipelines. These include image, video, audio, text, embedding pipelines, but there's also image tagging, face landmarks and some other non-embedding pipelines in there as well. I won't dive too much into this again, but just to show you that TOEI maintains a degree of flexibility there. There's also a training fine-tuning subframework and one thing that I think is really cool is something called data collection, which is a method-chaining API to build entire applications or data pipelines in just one line of code. So you can essentially chain a lot of these operators together to create a pipeline to turn unstructured data, to turn, let's say, natural language into an embedding that you would like for use in Milvus. Milvus also, you know, there's also administrative and visualization tools. And for this particular set of tools, we have what's called A2. Again, I won't talk too much about this, but I definitely recommend, if you're interested, in Milvus to go check out a little bit about A2 as well. All right, so talking a little bit about Milvus architecture now, right? And this is a very, very high-level overview of what Milvus is as a cloud-native vector database. Before I dive too deep into this, I do wanna mention that this is Milvus 2.0, 2.x, we are currently on version 2.2, and Milvus 2 is meant to be a fully cloud-native version of Milvus. So where Milvus 1.0, you can, it's akin sort of to MySQL, where it's really meant to be standalone. You can do sharding, but it wouldn't really be too scalable beyond, let's say, 50 nodes. Milvus 2.0 is meant to be fully cloud-native. It takes advantage of a lot of cloud architecture, such as object storage, as well as the scalability, the ways of really putting an application at scale in the cloud. Before I talk a little bit too much, dive a little bit deeper into the architecture itself, I do wanna mention that there, we do have a paper that was published in VLDV 2022 this year, which is one of the top database conferences, which talks about this cloud-native vector database in a lot more detail. Everything from some of the nitty gritty, such as timestamp oracle, log broker, all the way up into the big picture of why we decided to architect Milvus this way. All right, so diving into the architecture itself, right? Really, Milvus is composed of four individuals you can call in planes or services. The first is the access layer. Now, the access layer really is multiple proxy nodes and each proxy node will route these individual types of requests to the right location. So in this case, it will point data definition language and data control language instructions to the coordinator service, and it will point data management language instructions to the message storage layer or to the log broker. Now, I won't dive too deep into these data-related languages. It's probably not too interesting for folks who are not hugely into databases, but just so that we've covered them, DDO or data definition language is any command or any, yeah, any command that allows you to modify or define the database schema. Database management language is any command that will let you store, modify or retrieve data, and data control language helps you define user rights and permissions to access the database itself. So let's say access layer, right? It takes all these individual commands and routes them to the proper location. Now talking a little bit about the coordinator layer, and you'll see that each coordinator layer, I'm sorry, my webcam is actually blocking this a little bit, but each coordinator layer is actually associated with a single type of worker node cluster. I'll get to that a little bit later in the upcoming slide, but essentially these coordinators are all stateful services. There is one instance each of each coordinator node to manage all these integral clusters. The root coordinator node will handle DDO and DCL requests, whereas the query coordinator node will manage the query node cluster. The data coordinator node will actually manage the data cluster, and it will also be responsible for triggering these background operations. So in Milvis, as a vector database, once data comes into the database, it will actually create what it's called, what's called these Delta files. And these Delta files are essentially differences, diffs between the latest data that has come in or the latest modifications that come in versus what already exists in the database. Once these Delta files reach a certain size, then it will trigger these background data operations like flush, compact, so on and so forth. And that really ties in very closely with what the data coordinator node is meant to do. The index coordinator node manages the cluster of index nodes, and it really determines when indexes should be built. So building upon the idea that I was just talking about of these Delta files, once I reach, once I have, let's say some number of inserts that I can define in a Milvis parameter, then it will actually trigger a rebuilding of the index. Okay, so talking a bit about the worker layer itself now. All workers are stateless, and this allows Milvis to scale up and down each of the worker node clusters as the user or the application sees fit. The data nodes are there to retrieve incremental log data and then pack and store them into log snapshots in object storage in S3. It also processes mutation requests. Again, I won't dive too deep into that, but just know that the data node is really responsible for maintaining a lot of log data. The query nodes as the query nodes as the name query suggests is there to load indexes and then run queries in scale. Run queries at scale, excuse me. And the index nodes are there to build indexes on the inserted data. And one thing that I do wanna note is that each of these individual node, each of these individual clusters can be scaled up and down as your application sees fit. So for example, if I have an application that requires lots of queries, but maybe not so much modifications or inserts, I can keep my query node cluster fairly large while I keep the data node cluster and the next node cluster a little bit smaller because I don't really have to rebuild indexes very often or I don't have to maintain a lot of my data very often either. All right, and then the final piece of the puzzle here is the storage layer. I won't go too much into metadata storage or the log broker. There's a lot of detail about that in the Milva's paper, the Milva's 2.0 paper, excuse me. But object storage is essentially there for three different purposes, right? To store snapshot files of the logs themselves. And Milva's, one thing that I forgot to mention earlier is that Milva's uses a concept called log as data where you have data come in and it will be stored and packed in these logs. There's also index files for a lot of scalar and vector data and it will also store intermediate query results as well. So object storage really is there just as a pure storage layer separate from all of the other stateless and stateful nodes and it's to the access layer as well. All right, so that was probably quite a bit to unpack. And for those of you who are probably not as interested in systems or in how databases really work, it's probably, I really wanted to condense a lot of this information to some key takeaways here in this particular slide. So there's a single coordinator instance per service type and each of these, well, except for the root coordinator, but each of these coordinator services is responsible for managing the corresponding worker node cluster. In Milva's data is also stored in these concepts called collections and that is a kind of collections in MongoDB or tables in relational databases as well. Now what this provides us is disaggregation of query indexing and data in addition to disaggregation of these different planes as well, this gives us the ability to scale horizontally and at the same time minimize the usage requirements of Milva's when it comes to compute, right? And then there's also, in hindsight, I probably should have put this last bit in an earlier slide, but there's also this concept of log as data where operations are centralized around this log broker and then we have these quad operations are done by subscribing to, writing to and consuming logs. All right, okay. So now that we've gotten that bit out of the way, I do want to talk a bit about some of the real world use cases for vector databases, right? And, you know, in this particular section, I'm only going to go over four of them, but there are a wide variety of use cases out there, you know, hundreds, but I really decided to focus on four that I really like to talk about. The first two are probably less known use cases, but I think they show the power of vector databases and how they can be used in a wide variety of applications. And then the later two will be some more popular use cases for vector databases today. So the first is art gallery experience. And this is something that we did with Cleveland Museum of Art. The idea here is that, is really to actually reverse image search. So given, let's say a picture, or given, let's say, you know, you see someone down here is doing a bit of a pose, you can actually search for, search for within Cleveland Museum of Arts database, particular types of artwork that correspond to your input image. While the application itself is not super exciting, I think what was really exciting, what really stood out about this particular application to me is the fact that Cleveland Museum of Art was able to bring this application called Art Lens AI up and running within a week. And when folks talk about, you know, democratizing X or democratizing Y, I think being able to bring this type of data infrastructure to scale and being able to have really small engineering teams embrace the power of vector search, I really think that is, that's one of the things that Miller's test really, really well is democratizing data infrastructure for AI for ML, right? I think the art gallery experience use case is a great example of that. There's also threat detection. This is something that we do with TrendMigro. And the idea is that you have so many APKs being generated per day, these applications, these Android applications, and to really understand which ones might be malware or to understand which ones might be, you know, might be a virus, what TrendMigro did in conjunction with Milvis is to turn these APKs using feature engineering. So for example, understanding the number of hard drive read and writes that an APK does per second or the network traffic patterns of a particular APK, distilling all this information to a feature vector and searching for that in Milvis, something that TrendMigro that we did in conjunction with TrendMigro, and you know, we were able to generate a great result for them there, just completely open up new ways of detecting threats. Property search, so now we're getting into some of the more, property search or personalized search, excuse me, so now we're getting into some more of the common use cases. The idea of here being that if you've ever tried to look at different houses, instead of being able to search for property just by let's say, number of bathrooms, number of bedrooms, these labels or these tags, you can actually take an entire floor plan to distill that knowledge into a vector in addition to the area outline and orientation in other collections and be able to have, to be able to recommend a variety of different types of housing based on what a particular user likes or based on a particular user history. So this is an example, I think that really shows how powerful vector search is in general and how Milvis can be used in this particular type of application. And the last one I think probably one of the most well-known ones of vector databases today is product recommendation. So if I have a particular user and I can embed these user profiles and products into the same space, or if I were to let's say take descriptions, take images of the product and to be able to search for similar products based on those semantics, that is one way that Milvis can be utilized to generate completely new revenue streams for a lot of these e-commerce sites or for a lot of these let's say targeted advertising for a lot of these applications that require targeted advertising. So yeah, that actually brings me to the end but I'm gonna abrupt and I apologize for that to the presentation today. For those of you who would like to know a little bit more about the vector database ecosystem on the left there I have all of our open source Rebos, so from Milvis, Tohi and Ahtu. And on the right there is my personal contact information. So feel free to get in touch with me if you have any questions, comments, concerns about vector databases, vector search, AIML. If you wanna talk about the weather or anything like that, my figure of door is always open. So yeah, and with that being said I think I would like to bring it to Q&A. So thank you for listening to the presentation. Thank you for coming today and I hope this was informative.