 So hi everyone, thank you for joining me in this presentation if you're online or in person. Thanks for coming up and Yeah, today. I'll be going over Milvus our vector database And it's the time to accelerate and and searches on large-scale vector data sets a little bit about myself my name is Philip Holtmeyer and I am a data engineer at Zillis and As a little background for Zillis Zillis is a data company that was started in 2017 and at this point We're 80 people strong Zillis is the creator and the main contributor to the Milvus project and Here's my info if you have any questions after send me a message and As a quick rundown of what the project has been through so far. So the idea started back in 2018 We later open sourced it in 2019 Join the LF and AI LF AI and data foundation in 2020 and in 2021 We released our 1.0 and 2.0 version and also graduated from an incubation project to a graduate project in the LF AI and data foundation So here are the contents for the talk today first I'll be going over the problems of unstructured data and What Milvus is trying to solve then I'll go through what is Milvus where it fits into people's pipelines and What features it offers after this I'll go over the architecture a mid-level overview not too complicated And finally some real-world use cases to see where it's being used and how cool the technology is So first let's go over the problem of unstructured data and why we need Milvus So there are three types of data. There is unstructured data structured data and semi-structured data The structured data the data has a defined pattern and a solid structure and can fit in tabular places So a good way to think of this is something that you might write into Excel into an Excel document So that's things like strings numbers and dates things that can easily be compared to each other Then there's unstructured data unstructured data accounts for roughly 80% of all the data that we are currently holding It has no structure and it's really hard for machines to understand the things that fall into this case are things like audio images Videos language stuff that needs some more processing to be done in order for machines to understand it and lastly There's semi-structured data and for semi-structured data It's kind of in between a good example for this I believe is an email all emails have a header They have a subject and they have a body and maybe some attachments so overall there is a structure But what's inside those areas is unstructured so an email body will be just language and machines will have a hard time understanding that A few other ones are XMLs and json's So why is unstructured data difficult? So the big reason is with structured data you can rely on traditional parsing methods and relational databases to store and search through your data This can't really be done with Unstructured data up until now With the rise of deep learning methods and neural nets we finally have the key to unlock the data and Actually have it machine readable so what this means is the process has turned from a Figuring out what to do with these things to figuring out what kind of vector computations We have to do and by vector computations I mean because a neural net if you the data you throw in will usually come out as a vector and Now I'll go over a little bit about vectors and how they're different from numbers So numbers have the basic arithmetic operations a distance subtraction multiplication and division They're easily comparable. They have pretty much three states You can either be greater than less than or equal to and this results in some really easy indexing methods So one of the most common ones are bee trees and for those of you who are not familiar with indexing Indexing is a way to rewrite your data so that you can have quicker access speeds usually at the cost of right operations in storage space Vectors on the other hand don't really have these simple arithmetic operations the main operation that you use with vectors are similarity calculations and two of the main ones are the Euclidean distance alti norm and Cosines distance L1 norm and these calculations can be a little bit difficult and What results is that you don't really have a direct way of comparing these vectors So that's where approximate nearest neighbor comes into play and we can see two examples of this So on the bottom left we have clustering and on the bottom right We have graph-based and these are just methods to be able to index these Vectors even though they can't be directly compared to each other and I'll go a little bit more in details in the next slide So these are the four main vector index types So you have hash-based which the main library is Falcone and will have Falcone and hashing in indexing It's kind of the opposite of a regular hash with a regular hash. You're aiming to minimize the amount of conflicts and On the other hand with locality sensitive hashing you want conflicts. You're maximizing it So this is how indexing happens So if you were to index your data You would first run it through the hashing algorithm to get your buckets and then when you're searching for something another index You would run it through the hash and figure out which buckets in to find the most similar results Next type is tree-based and for tree-based. We have Spotify's annoy algorithm And the way to think about annoy is just it's a binary tree So what you do with annoy is you take two random points and you create a hyperplane between the data And this is your first split in the tree. So you keep going down the line Selecting two random points on each side of the tree and creating another split until you get to the specified number With annoy you can specify how many you want how many leaves you want at the bottom And this is useful because when you're searching it just turns into running down the binary tree so you take the first split you look at the The split and you see on which side it falls and then you just keep following that line until you get to the Closest results at the bottom and to speed things up and make things more accurate Since it's randomized at the start you can create a forest of trees and with that forest of trees You can throw them into a priority queue and your result with a pretty fast and reliable method of searching all this data Next type is inverted file based and this is the most popular type This is where Facebook's face library comes into play and the way this work works is it just works through clustering So the key for every inverted file is the centroid of the data that you're storing So and the data in that inverted file is just the raw vectors that fall in that cluster So when you go to search you compare your search vector to the centroids You find out which centroids are closest and the next step is to just look through all the data in that inverted file And this is a really good one because it allows for expanding data with a few other ones You kind of hit a limit at some point and you can't create any more Index files or if you do index you'll have to be a long process of creating a new one with the inverted file base You can just create some more inverted files Next we have the graph based which is the main library's hmsw and with this you build a Multi-layer graph where the top layers are pretty sparse and they get more dense as you go down And the way this works when searching is you just look for the closest neighbor or the closest node on the top layer You find it and then you go down a layer and then you keep doing that process And as you go further and further down the results will be closer and closer because it gets more dense And this is a pretty cool way of searching the vectors. So these are the four main types of index types So what is milvis and where does it come into play? So this is where I'll talk about what we offer and sort of how it fits in This is a pretty standard ML data pipeline for now I'm not sure how many people are familiar with it so we can sort of ignore the training part of the pipeline because milvis Doesn't really have much to do with it what we'll look at is the production part of the pipeline and there's three major steps before milvis and Our two major steps before milvis and those are the first box is our data flow So this is where we're insert or getting data into our pipeline And the way you can think of this is let's say a user uploads an image from their phone Or you're pulling data from the stock market something like this. It's your unstructured data. It's coming into the system Next you have your model. So this is where you decide which kind of model you're going to be using for your data Images have convolutional neural nets languages have transformers and then there's some LTSM's for like, let's say stock market data, which is more for serial data and time sensitive data So after you throw your data through those models you result with a vector So this vector is what milvis deals with and what we store So you can insert it there into milvis and store it and index it So when you're performing, that's where this pipeline ends for inserting step three milvis When you're searching you pretty much take the first two as well Your data comes in and you also encode your data You need to be using the same encoder to get the same type of vectors and to have the same parameters for the vectors Once you have your search vector You throw it into milvis and you perform a search and this will you can say how many results you want How many things you want to search for? What parameters do you want for every type of indexing method has a different parameter? And with this you get a result of the closest vectors and from there It is up to you to decide what you want to do some uses use cases want to rearrange the results You want to reorder they want to do some more stuff some more post-processing And after that you give it to your customer and then ultimately based on their feedback if it's similar if it's good That means that your encoder is working well if not you can take that feedback and kind of train your model up to do a Little bit better. That's the general pipeline of where milvis fits in with all these things Now I'll go over some of the features that we support with milvis So first of all it supports heterogeneous computing So we support many x86 instruction sets including SSE AVX 2 and AVX 5 12 We also support NVIDIA GPUs and we're currently working on certain FPGAs and MLUs and our impasses our processors are in the works But they're taking a bit of time Secondly we support many key database functions so things like data partitioning data sharding Supporting all the crud operations. So creating reading updating and deleting Which you can't really be a database without those and then also filtered queries and searches Next we also Provide top-of-the-line search performance based on those libraries that you saw before so the tree base the Inverted file based and the graph based and a few other ones and we implemented them and improved on them To kind of speed up our system and lastly we support many application development environments We support go Java C++ Python and the restful API and we're working on a few more Next comes cloud nativity. So with 1.0 We designed it around only being locally run and then we realized that a lot of people want to run it on the cloud speed up things kind of make it more reliable the issue was that we were working with Using a shared storage system which as everyone knows can bottleneck on a lot of search requests And we really didn't like that so with 2.0 We aim to be fully cloud native and what we did is we made a kubernetes native and we are using helm to deploy and Helm allows for pretty easy deployment and changing up of the cluster scaling up scaling down a lot of other things We based our storage design on minio and for those of you who don't know minio is a way to kind of Simulate s3 on your local storage, which is really useful because it lets you kind of test out all your things and minio also Offers gateways for the other cloud services. So if you want to use Google Azure's blob and Google cloud storage You just use this little gateway and it can transform those requests to a native their native API calls We're gonna work on kind of supporting everything natively is just s3 was the most used and the easiest to implement first And lastly, it's easily scalable. We're highly elastic. We disaggregated the storage in the compute So this allows you to kind of scale everything easily and have easy data recovery And we also separated the read writes and background services So now let's go over quickly like a mid-level architecture design You get a lot more detail, but I'm gonna try to keep it sort of simple for now But this is our milvis general architecture so the main idea is that we use a log as our backbone which can be seen in the center in the Message storage and the reason we use a log is because it lets you Kind of split everything up and do this kind of disaggregated system before we're using storage and that didn't really work In addition to the log system we designed a unique time schema Which allows for accepting unified stream data and batched insert data Which is really good because it allows for consistency across the system even when you're streaming or batch inserting and What you can see here is there's mainly four parts of the system. You have the access layer you have the coordinator services you have the Workers and then you also have the storage and I'll go into all of these a little bit later But the main idea is all of these layers are mutually independent so that allows easy scaling and easy data recovery So let's go over the access layer first and before we do that There are a few definitions we should probably go over because I'm not sure who's familiar with database systems so with databases there are three main types of Syntaxes or commands So the first is data definition language requests and what those are there are commands to define and modify Like your table schemas the names and what you're storing and how your whole table looks like this doesn't actually deal with the data you're inserting it's just more to kind of create the system the The frame for the data then you have your data management language requests These ones are the actual ones that are inserting data that are storing it modifying it and retrieving the data And then lastly you have your data control language requests And these are commands that define rights and permissions through the system so more administrative things So let's first start with a proxy node So the proxy node is stateless fully so it's easily scalable and its crash recovery is pretty simple And it's the user endpoint so before you do have the kubernetes just simple load balancer But this is where everyone is accessing your system and what the proxy does it is pre-process some of the requests So what this means is that it will check? Let's say you're trying to create a table does that table already exist if so instead of going through the entire system It can just throw out like you can't do that It will also tell you if you're inserting data and it doesn't fit the schema for the table That means that like let's say your tables designed to store vectors of dimension 512 and you try to insert a 400 Dimensional vector it will tell you right away that it won't work And that allows kind of speeding up things because it avoids wasting and going throughout the entire system and just throwing an air at the end So yeah, so all the DDL and DCL requests are pointed to the coordinators and everything dealing with data So the DML requests are pointed to the log which is ultimately used by the worker nodes The next layer is the coordinator layer These ones are not stateless So these ones are a little bit important and we're working on a way for making these very highly available and kind of fixing that all up So the first most important one is the root coordinate coordinator So this guy is pretty much in charge of everything and he handles all the DDL and DCL requests And it's also the time oracle So we want to have one centralized time here because we are working with timestamps for consistency So he's in charge of that Next we have the data coordinator node So it's involved in triggering all the background data operations. So things like flushing data to storage things like compacting the data if it's very segmented and Stuff like that. So kind of these coordinators are in charge of the background applications I don't really have to do with you and how your system runs. It's more just keeping everything up and running smoothly and What it also does is the data quarter maintains the metadata of the inserted data So this is pretty much things like how much data you're inserting where it came from Stuff like this some information that you might need later And it also manages topology of the worker nodes the data nodes So this is just kind of controlling it controls them tells them where to look it handles when they disconnect Reconnect and those types of things next we have the query coordinator It coordinates all the searches and queries It manages load balancing for all the searches because the server the worker nodes are stateless again They don't know about anyone else. So it tells them how what areas segments to look out of your data and what to search through And it also again manages topology, which you'll see in all and they all manage the topology of the respective worker nodes Lastly is index coordinator It's in charge of assigning index building to the index nodes So each node needs to build an index on some part of the data And that's the one in charge of telling where to which data to goes to which worker node And it also contains the index metadata every index So like if you're using face annoy or one of these they all have their own parameters And these are important because it really changes how they work and when you're searching There's also different parameters for searching to need an area to store that data And then also manages the topology of the index worker nodes Next layer are the worker layer. This is I would say one of the most important layers does all the work So overall they are the ones that handle all the DML requests So all the data mutation all the inserts all the deletes all the updates and they're all again stateless. So easily scalable The data node is the one that deals with inserts mainly So what it does it it retrieves the incremental log So all your inserts are put into this log and the data nodes are the ones reading it They each have their own channel and they each read data from that channel So what does it takes that request? So let's say it's an insert this vector It packs it into a log snapshot. So the way that we store data is we just say okay You inserted this vector inserted this vector inserted this vector you compress that and you save that as a file and you push it to our storage And then it also processes all the remutation requests So if you're deleting if you're updating it's the one that pulls up the past data file and then changes it around Next we have the index nodes and those guys are pretty much just in charge of indexing. They're the workhorse indexing is usually the most process heavy part of this whole pipeline and Yeah, those are the ones that need a lot of power and last we have the query nodes And what they do is they just load in the indexes and the raw data and perform searches Those are the how pretty much the system functions. Those are the ones that are dealing the most work and Then our storage layer is the last layer and they are pretty much all comprised of third-party open source software The first one is the log broker which we're using pulsar We're planning on using Kafka in the future and kind of having it work with that It's in charge of all the data streaming and keeping it consistent with the time The way it works is it's a subscription base so your workers subscribe to a log and they just pull the whenever the pulsar pushes a new insert or new change they read from that log It also guarantees reliable asynchronous queries and it provides event notifications So everyone's kind of listening to pulsar to know if something happens in the system It's kind of our central backbone to the entire system Then we have etcd which I believe is also an Linux foundation open source project And we're using that for metadata storage and in terms of metadata We are using it as a service registration whenever you scale or you start up the system Everyone reports to etcd so that everyone can learn about everything going around It also is it starts up the checks to see if everything's still alive It's kind of our heartbeat for the system and it also creates checkpoints within the entire system Just so if the system crashes you have recovery points So it has like a time stamp in the log and it saves those type stamps So you know if you ever want to come back and then the last one I went a little bit over minio, but minios are storage We use minio because you can run it as a node in a local system And it's pretty easy to use and then it easily transfers to s3 because s3 and minios share the same API and It's used for storing files to the logs the index files and also Intermediate query results if they're too much to handle on the worker nodes if they're too many results So real-world use cases this might be the air that's a little bit more interesting a quick rundown of our project I believe as of today since making these slides. We have over 8,000 commits and 8,000 stars We have 148 active contributors. We have 24 releases Over almost half a million docker hub downloads and over a thousand users and we've been going to a lot of meet-ups and events over a hundred About the users so a thousand enterprise users around the globe They are in a lot of different areas because this can apply in a lot of places They are used it's used in major banks online market places pharmaceutical research companies real estate agencies and Productivity software companies. This is just naming a few But I thought these ones were the big ones because they kind of over arch a lot of the technology area and The common themes that we see with everyone is they have large-scale data sets usually 10 million entities and above Their existing solutions are too slow So they're using relational databases for something you shouldn't really be using a relational database for and another one is That they decide to reduce hardware costs. So saving money is always nice So the first one I'm going to go over is the smart property search and this was done by Compass So it's a major US real estate company. They deal with renting selling and buying real estate assets And they want a better recommendation system because the one that they were using was really slow and I they didn't tell us which one but I think was just a relational database and they wanted to speed it up for user satisfaction and to get better results and Yeah, they were previously using just traditional search engines, which were too slow and To kind of go over what they did so what they did is they took all their data So let's say they had an image of a floor plan like the exact outlines of the floor plan They would take that and embed that into vectors. They were using their own neural nets There was like the area I'm not sure if they really embedded too much. They had the outline of the house. Maybe the direction the house was pointing The location the house was in like town-wise and they would embed all these things into vectors and then for each one they would create a collection and Terminology collection is think of it as a table and They'd each be a different one because all of these Features might be different lengths. They could be 512 dimensions or they could be 128 can't store those in the same one Because if there's no you need at least the same size vector. Otherwise, there's no way to compare them and Yeah, so they had stored them each in their own collection and then they performed a search So if a user came in they wanted an area of 500 square meters They wanted it to be in this location and they wanted this outline. Maybe they draw it in that would all be embedded and Turn into vectors and then ultimately after searching you'd get the results from every collection And since they're all distances you could they had their own ranking algorithm If someone was more interested in the area compared to the size you could weight that result higher And when you're combining everything and that's sort of the system that they made it took only six months from the idea to being all the way in production and They saw higher customer satisfaction, which ultimately means that they were making more money. It seems and for future work They had a cool fuel a few ideas. So Since already dealing with unstructured data the idea was that they could use pictures videos voices and text So one idea was like let's say you have a nice view You can take a photo of the view and you can embed that into a vector if someone's looking for a house with a View of the ocean you could have that in the system and then you could find similar results for that Or if you wanted mountains in the distance or if you wanted some type of view that you were looking for you could insert that Or if like you have the surrounding sound noise if it's loud then you can see and then you can kind of find similar results And they are also working on AI chatbot Which doesn't really have to do with smart property search But it's another use case of Milvus and vector similarity search and the next one is product recommendation This was done by Tokopedia, which is the largest e-commerce platform in Indonesia So what they were doing they were previously using elastase search for all their search Recommendations and the way they were using it was raw keyword counts, which nowadays It's not really as good because we've gone a long way with semantic word embeddings And what this means for those of you who are not familiar semantic word embeddings are a way of taking your words and Grouping words that are similar but not like exactly the same Into a similar category to have similar vector values So think of it as a thesaurus The thesaurus how you search for a word and it has similar words that mean the same thing But they are spelled completely differently, and that's what they wanted to do So what they did was they took their products their descriptions categories labels They converted them all to embeddings using word to VEC, which I believe is one of the more popular word embedding Encoders and they threw it into Milvus And then when someone came in wanting to search for some item they would just throw it in it would get Embedded into a vector and they would search for and you would get the best results Based on the semantics of the word and not just the keyword count so it wasn't just counting how many times you said car in the description and The results were pretty great so they had a 10 times improvement in query performance a 90% reduction in hardware costs and a 10 times accuracy improvement in search accuracy and Future work since they're already almost there with embeddings. It's probably personalized advertising for them The next one is in a little bit of a different area. It's pharmaceutical molecular analysis This was done by a leading international pharmaceutical company And what they were doing is they were researching how drugs interact and they were trying to find certain interactions and they had a data set of 800 million molecules and Let's say they have a new molecule and they wanted to see if it works in a certain reaction It would be kind of hard to test that reaction or might be expensive to test that reaction So what they wanted to do is first check if there's a similar molecule that they've already done that test and see if it works If it works that kind of gives them a reason to test it because there's a higher chance it would work If not, they'd be pretty confident that this might not be the solution so they could skip over ultimately saving a lot of time so in their previous use they were using a spark cluster and it took around 10 minutes to search through 800 million molecules using 30 nodes and We can pretty much assume from this is it was just a brute force search and nobody ever wants that So they decided to use milvis and the way they did this is you would take your molecular formula So every molecule could kind of be described in a standardized way of kind of writing out the molecule And there's this package called rd kit and rd kit is a chemical analysis package I think you have to install using conda and it's pretty cool that it transforms these molecular formulas into binary fingerprints so binary vectors and luckily we support binary vectors So what they did is they threw them into milvis and when dealing with molecules you kind of use the teni modus similarity So in the previous examples, I talked about like how there's Euclidean distance. There's cosine similarity Teni modus another one. It's equivalent to the jacquard distance and When you're using it on binary Values and that's what they ended up using and it worked pretty well So you'd perform a search and I'll show you what the closest results were it had a 1200 times speed up so it went from 10 minutes to half a second and This was done on one node so they could reduce a lot of the hardware costs And for future work, they want to work on a pharmacological analysis and pharmacological analysis is studying how a chemical or drug interacts with organs and like your human system and Another one is they want molecular activity predictions Which is just instead of having to test the molecule can they predict what's going to happen? And they were thinking that they could use milvis in those use cases Um second to last is probably the most popular thing in terms of similarity search I think Everyone knows about this one and it's reverse image search So Google has something similar where you can plug in a photo and you can search it and find photos that are similar And the background for this one. It was the Cleveland Museum of Art They wanted to kind of put more technology into their exhibits. I thought it was pretty cool I didn't know it's one of the largest museums in the United States I believe it's the second largest and Yeah, they were combining technology and art to kind of increase people's interest So they're doing a lot of things with virtual reality And they have exhibits where you can go with the virtual reality headsets and walk around and what they did with milvis Is they kind of wanted like physical interactions? So ultimately what they were doing is they were creating a art search engine So in person you could strike poses right next to like a figure and let's say it was a sculpture And you could tell how similar your poses to that sculpture So this was something that they did in person and then online they put a reverse image search for their entire art collection The ones that they could legally Were allowed to kind of post online in terms of infringing any rules for art But so for this example I kind of pulled up the site and I threw in the Golden Gate and what it gave back was a sketch of a bridge I think in New York But this is the system that they were making and this is a pretty common pipeline for this type of thing The way it works is you take your photo you throw it in our case We recommend people to use two neural nets The first one is a yellow net and what yellow nets really good at is pulling out objects from images Let's say you have a video stream and you want to get all the images of the bikes out of it This is what yellow nets good for because it gives you a bounding box for everything inside the image So what you do is you throw your original image through yellow net It pulls out all the key feet like key images in there or things that are areas of interest And you throw that through resonant for the actual embedding and for resonant and a lot of these models What you have to do is they're usually trained on classification and with classification the last layer you lose all your data Because the last layer you're just combining it all to see if it's a dog or if it's a cat or if it's like some bound object With the layer before that that's where you have your embeddings and that's where all the data is stored So you take off the last layer of resonant and that's how you get your embedding You put that into Milvus and then similar pipeline you throw what you're searching for through those neural nets And you can search for it and give the closest result The last one as you guys can also see it's a very similar process for incorporating this It's always take your data and code it put it into Milvus But this one was for Huawei. They were working on a music recommendation engine for their phone app And yeah, so they didn't have any use case before but they would need something else fast And it could deal with multi-million scale data and what they ended up doing was using Milvus So it's currently in pre-production But the way they did it and the way you mainly work with music is you first take the music you're working with that you want to Insert and you have to separate it to the background music and the vocals And this is usually done with things such as audio inversion and what this is important This is important because with vocals you can your neural net might catch on to the Sound of the voice or it might sound catch on to what the person is singing You don't really know with vocals and that's an issue because you have a lot of cover songs So using the vocals is usually not a good one to find similar songs background music is more Unique to the song and it's a lot easier for the system. I believe to understand but the neural nets you never know It's a black box. So they got the best results by using the background music. They were using a convolutional neural net there's a lot of other different networks nowadays, but I think it was a one-dimensional neural net and From that you could get your embeddings out of the vector embeddings out of the audio throw that into the milvis and then you could perform the same thing again you search for a Song that you listen to and then you could find similar results interesting thing about this is with music recommendation You probably don't want the same exact song you want something that's similar But a bit different because if you just return the list of the most similar songs is probably gonna be the same exact beat There's a lot of song share beats so what they were doing is that they would let's say they looked for the 70 closest results and they would reverse that list because You want something that's similar but not exactly the same and that's where what that's what they decided to do and These are pretty much the use cases. I wanted to go over because it kind of goes into every field. It's audio vision Text pharmaceutical it just covers all the bases But there are a lot of people using different setups of the systems doing different pipelines and they're doing really cool stuff So, yeah, that's my presentation. Here are some resources We have our web page We have our github Twitter and then our medium blog and we're also participating in hacktoberfest Which is like a cool program to get people more involved with open source And it promotes new users to kind of work on easy issues and we're offering some cool rewards if anyone's interested in that and yeah, if Anyone has any questions there might be a few but uh, let me know Sure Yeah, so it's up to the user So if you're running it in s3 you upload pretty much store all your data s3 You can run a minio node in the cluster and give it its own storage But if you're on s3, it's just worth it You can just replace it because the entire syntax the same you kind of just get rid of the node And you just put the s3 address instead in helm and that's how you kind of get around that So we're working on SSL support. It's one of the things also for SDKs We used to have it in 1.0 But we're kind of adapting it for 2.0 because we had some issues that's on the docket It's the whole product 2.0 like the one that's mainly cloud oriented is still in release candidates So we're fixing up some bugs and working on what features to improve and what to implement sure so the main points is The actual indexing not none of these places really support indexing There's multi-column indexing But that won't work the same way because it'll first have to find one that fits the first row and the second row and since The first row might not be similar, but all the other ones like in terms of vectors the first Space might not be similar. So in a multi-column index, I believe at least if you don't catch the first it'll avoid the rest So that's where a big problem comes in and like a last a search is trying to solve that They're doing some dense vector indexing. I think they're limited to vectors of size 2048 But that's where a big problem of these old databases is is that you result to Pretty much just brute force searches across the data and that's never good if you're dealing with multi-million datasets Any other questions? I guess not thanks everyone for joining me on this. It's a little bit short, but I think I got most info Have a good one