 Hi, everyone, welcome to my presentation of the Vector Search Engine Weaviate. I'm Laura Ham. I am a Community Solution Engineer at Semi Technologies. And I feel honored to speak today at the Open Source Summit. I'm a big fan, first of all, of Open Source projects and communities. So I'm glad to share this with you, our project with you that we're working on. And of course, I would have liked to present this project in the physical event in Seattle. But I'm happy, at least, that we have the opportunity to do this remotely. I'm looking forward to meeting you all remotely and to hear about your projects as well and what you think about Weaviate. You can reach out to me in the online environment of this event. But also, you can find me on LinkedIn. You can send me an email or you can also join our Slack channel, the Weaviate Slack channel, by scanning this QR code. So in this presentation, I will introduce you to our Open Source Vector Search Engine Weaviate. And I will structure this as follows. So first, I will introduce you to unstructured data and the problems and opportunities that it comes with. Then secondly, I will tell you how these problems can be solved with a vector database. And a vector database is a relatively new concept. So I will explain to you in detail what this means. Then, yeah, since Weaviate is, of course, a vector database, I will tell you what Weaviate is and its features. And I will show some live demos to show the features. Then I will talk about the machine learning models that Weaviate can use. Weaviate can use, namely, any machine learning model. And you can also use Weaviate to scale your own machine learning models. After that, I will go a little bit in depth about the open source aspect of our project and how you can get involved. And finally, I will briefly also say how you can get started with Weaviate and how you can get involved in a community. OK, so yeah, first, let's look at data, and particularly about unstructured data. So unstructured data are forms of data that are not organized in a predefined manner. So take, for example, big pieces of text. That can be documents like scientific papers or news articles or, for example, product descriptions, reviews on your website, and so on. We learned that actually 93% of data is unstructured and that 80% of businesses don't even know how to leverage unstructured data. So what's actually so difficult about unstructured data then? So we know that one thing is searching through unstructured data is extremely difficult. You can usually only find, for example, text if you search with a specific keyword and it matches this keyword or you have to define tags and so on. And we found a source that says that 80% of the time that data analysts work, they spend on preparing data to answer business questions. So that means that they are organizing unstructured data or tagging these unstructured texts, for example. So for us, that's a shockingly high number. And that data analysts need to spend before they can actually start answering questions with the data. So let me give you a simple example of searching through unstructured data. If you want information from unstructured text, you will need to use exact matching keywords, as I mentioned before, to find a result. So for example, here we have now a data set with different kinds of wines. For example, if this wine has a name and has a description, and in this description, it is mentioned that it is good with fish. Now, if you have a traditional search engine and you want to find a wine that fits with your seafood dinner, for example, then this wine will not appear in the results because, yeah, the seafood is not a word than fish. So there are no exact matches. So the result is that maybe no products are found. Now, if you have a vector search engine, on the other hand, which has a model behind it that is trained on language and build a context around language, perhaps you can find this wine because it knows that seafood is close to fish, actually. And compare this with Google search. So if you ask Google a very abstract question, it might find an answer. So the question here, in the example, is what color of wine is chardonnay, is a bit abstract. And still Google extracts the exact answer from millions or billions of results, actually. So, yeah, it finds an exact node in this case, yeah, huge knowledge graph, which has the answer. So now the question is, how did Google actually find this answer? And also, how does it find it so fast? And then, yeah, the question also that comes to mind is, how can we do this on our own data? Because Google is really good at this in open data, of course, but you can't use Google's algorithm on your private data. So the main question is, so what if you could do the same thing with your data in a simple and secure way? Yeah, the answer that we came up is Weaviate. So Weaviate is a database that uses machine learning to make a context around data and find answers based on the semantics or contexts. So Weaviate is a cloud native modular and real-time vector search engine that is built to scale your machine learning models. So now let's go, yeah, a bit deeper in Weaviate and actually more in what is vector search and vector storage. So Weaviate stores data as vectors. And these vectors are placed in the space in relation to other data objects. Machine learning models can be used to factorize data objects. So it computes a vector for each individual data object and places this in context. And this is the actual database. So really, Weaviate tries to understand your data rather than just save it. So in more detail, this is how Weaviate works with a text factorization model. A pre-trained model, in this case, is a BERT natural language processing module, which you can find on, for example, the HuggingFace website, a BERT transformer model. And that can be used to compute factors from texts. So you can build an index of, for example, all the texts or all the English language, which you see on the right here. And what you can do next, then, is to add your data objects, which also contain text in this case. We can see later that it also works with images and also other types of media. But in this example, we have text. So Weaviate also uses this BERT model to places your data object, which contains text into context. So for example, if we add this Chardonnay we had before, or the model knows that it's closely related to wine and white wine and the fish or the food that it fits with. So then if you have a database in Weaviate with all your data indexed as factors, and you can perform search queries, you can perform search queries in natural language, which will also be vectorized and then placed into context. And after that, Weaviate will perform a nearest neighbor search and return the vectors, thus the objects, that are closest to your search screen. So in short, Weaviate has a pre-trained machine learning model attached, which can be anything. You can add your own data, which then will be processed by the machine learning model and placed into context. And then if you perform a search query, for example, if you want to find a wine that fits with your seafood, you can perform this in natural language. And Weaviate will return a list of results ordered by nearest neighbor. So now that you know the basics of how vectorization works in Weaviate, let's go over some features. So we already know that Weaviate is a vector database and a search engine which has full CRUD support because it uses REST APIs, but also GraphQL API to make the queries. So data is stored as vectors, which are long arrays of numbers, or also known as the coordinates in a high dimensional space, which allows for context-based search and also automatic classification. So let's go to a demo. First, just to show how this, yeah, what I just showed in the example to see this in real life. So if we go to our websites and the documentation, we have a RESTful and API references. And yeah, the GraphQL references. So yeah, we have plenty of examples of how GraphQL language works with Weaviate. And we have a running demo data set always. For example, for the website, we have a demo data set of news articles. There are a few thousand news articles of the past few months in there. And you can demo some queries with that. So over here is our interactive environments, which runs on the demo data set. And you can write GraphQL queries in there. So if we start now with a get query, and we want to see what kind of news articles we have in there. And here I'm asking for three different properties of this article that I want to see. So now here on the right, we have a list of results. And let's just check the first one. It is an article with a title, a URL, and the number of words. You can also see that there is the actual text in there. So it is a bit bigger data value. And this contains, yeah, this is the bit of unstructured text basically. So that's just very vital for now. And next to that, yeah, so this is just a very basic query. But we can show, we can see that there's actually a vector stored per data object. And here you see a list with one vector. So this is the few hundred big coordinates of this data object. So this is a very basic query. And it doesn't really show the vector database functionalities or the context-based search. So we can add a filter to find some articles with or by natural language. So let's do that. We call this filter a near text filter because we want to find articles that are near a certain text. And here I can type in my natural language. So let's find articles that are close to housing prices. And I get back a list with the results ordered by how applicable they are to the search query. For example, the first one is a history how housing prices became the world's biggest asset class. And then housing prices are going ballistic. Low mortgage rates don't help if houses are too expensive. And you can see that, for example, this third result doesn't match exactly the search query, but still expensive relates to money and prices. And houses is almost the same as housing. So you don't have to search by exact matching keywords, but it works based on semantics. And we can ask how certain the machine learning model was that this result actually is an article that fits with this search. So we can ask for the certainty. And we can see here that this first article or the machine learning model is 87% sure or certain that this article matches your search query. And you can also filter by this. So you can say, OK, I want a minimal certainty of 85%. And then you get only the results that are higher than this percentage. Yeah, you can make longer queries as well. So it's pure natural language. So if you want to go maybe a little bit more specific and maybe let's make the certainty a bit lower, then we can go to prices of houses in Greece. For example, we see, oh yeah, we have only one result here that matches the certainty. And we have a result that says, Athens housing market revival driven by foreign buyers. You can see here that VV8 kind of knows that Athens is in Greece. Otherwise, this result wouldn't be the first one to return. So yeah, this is how the vector search works within VV8. The second feature, and that's a quite unique one if we look at the landscape of vector databases around, is that VV8 combines vector search and scalar search. So what I just showed you was purely a vector search. But yeah, scalar search, that's more traditional way of searching like where queries in SQL can be combined with this vector search. And I can quickly show this to you as well if we go back to the demo. So let's say, let's see if we have a little bit more results so we can play around a bit. And then I can add another filter. For example, let's say we want the word count of an article be higher than a certain number. We can add a where filter with we are checking the word counts. Oops. We want this greater than, let's say, 1,000s. So now I have this vector search or the context-based search and a where filter, which is a scalar search. And that is combined. So we see, again, a very fast answer. Now the first result that we had before is removed because it had a word count lower than 1,000. And now we see a result matching the search query that we just entered. If we make it less, then we will see the previous result again. All right. So that was the demo. Then the third feature is that you can also make graph connections between objects. So we've had is not a graph database or pure graph database, but we can still make cross references between data objects. And I want to show you this as well. So if we build upon this search query, maybe we remove some things now. We can see, for example, in which publication each article is published. Now I'm adding this property. And this property is actually a reference to this other data class with objects, namely publication. So we're asking for articles that are published in a certain publication. Now we can see that this article is published in Financial Times, the second one in CNN, et cetera. And we can also make filters with that. So if we change, for example, our web filter to publication, and I want the name, for example, equal to, let's say, the economist game, I'll make a mistake here. That's because I need string to match it. And now I see results that are only appearing in the economist. So this is how graph relations are used in Weave It. So let's move on. So we have a fourth feature that Weave It is very fast. So with the RESTful and GraphQL interface, you can make very fast queries. We are currently working on horizontal scalability, which will be released really soon, actually. So you can use Weave It on a very, very big scale with millions of objects. And then Weave It has a modular architecture. So this means you can use or choose any machine learning model to, for example, vectorize your data. So that's what I showed before. Data objects are stored as vectors. And to make vectors out of a data object or out of text or an image, you can use a machine learning model. For example, models that are trained with fast text or models that you can find on a hugging phase, like transformers, like BERT, or spacey models, or RESTnet for an image vectorization model, et cetera. You can also extend Weave It with extra capabilities with other machine learning models. Examples here are a Q&A model, or a spell checking model, or a named entity recognition model, et cetera. So all the modules that I just mentioned come out of the box. But you can also add your own model to this to use Weave It to, yeah, scale them, maybe. And it's super nice that it's fully customizable the setup. So you can mix and choose your own models. So you can choose to have a BV8 running on a BERT model and using an entity recognition model and a Q&A module, for example. You can combine it all. And it doesn't require a lot of effort. It's just a few clicks in the configuration, or you can make a .compose file to run this. So in this fourth demo, I want to show you these modules. So until now, we have only seen a vectorization model module that has this near text feature. But now let's check another module. Until now, we had only seen the vectorization model for texts. But I want to show you the named entity recognition module. A named entity recognition module. OK, so yeah, I want to show you in another demo the different modules. So first one, the question answering module. So in the documentation, you see also how you can use this. So let's go there. So we have, yeah, this is how you use the question answering module in GraphQL. And we can use this again in the demo. So we have here, again, the articles data set, of course. And you can use a new search query, search filter, sorry, which is called ask. And here we ask the question, who is the king of the Netherlands? You can also specify in which property the model needs to look. So if I then press Enter, all the results will be visible on the right again in this property, additional property answer. So now we see that the machine learning model, the Q&A module, found an answer in this article. And the result is King William Alexander. And if we check in summary, so now I'm asking the whole summary to show, we can see that it found an answer in this piece of text over here, actually. So it's really the machine learning model that finds a specific answer in a bigger piece of text. And that's really nice to connect to VV8. I also want to show the named entity recognition model. We can make a new query to show this. So let's say contents. And we can ask the named entity recognition model module in an additional filter, named tokens. And I make a filter here, which has in which property to find, yeah, to classify tokens. And I want this in content. So I do properties, and I set this to content. And let's do only one for now. And yeah, let's see the entity, the words that it was originally. And we also have start position and end position. So if I run this query, we see that let's first check the post. So it's supposed it's a piece of, yeah, unstructured text. And here it finds a lot of, yeah, named entities. And for example, the word SCOT is recognized as an organization and so on. And this is all done by a hugging face transform model as well. So let's dive a bit deeper into VV8's module system. VV8 has a modular architecture. And you can choose to run VV8 without any modules as a pure vector database. Or you can choose to enable one or more of the available or your custom models. So a module in VV8 consists of two parts. The first is the module itself in VV8. This is written in Go and it connects with VV8's internal lifecycle so you can influence the business logic and the GraphQL API for example. And yeah, we for example saw this previously in the named entity recognition module over here. So we have this new piece in GraphQL and this is written in VV8 itself. So if we go there, we are in the get up of VV8 and we have different modules running here. And then the second part is the inference service. So this is usually a containerized application which runs the machine learning model or the inference model itself. And this can be in any language. In this case, we have an example of like the BERT transformer model in Python. Yeah, and then you can write in API wrapper around it to and serve this as a appliance to use by the machine learning model in VV8. So for example, this one has the hugging phase transformer model and then the second part of the model is the inference service. And this is usually a containerized application which runs the machine learning model or inference separate from VV8. And this can be in any language as long as it serves a few API endpoints that VV8 can consume. In the named entity recognition example, we have a hugging phase transformer model below it. So yeah, I just took, for example, the first model to do the named entity recognition. And I wrote a small API script around it to make this a service. So this looks, for example, like this is a really small the API, there are four API endpoints. So to let VV8 know if the application or the container is live and ready, a meta endpoint and the actual inference service which uses the pre-trained model by, yeah, that is available on hugging phase. This is the existing module landscape by the time of recording this. As you can see, we have, yeah, two different text factorization modules available but within these modules, you can choose like any type of transformer module, for example, we have one image factorization module right now which is ResNet 50 but you can again connect other modules if you want. We have a question answering module, a named entity recognition module as I showed. We also have a spelling checker module and then you can of course choose you to make your own custom module and yeah, use VV8 to scale that. And as I said before, you can also choose to not run any module and use VV8 as a pure vector storage and search engine. Now it's possible to make your own models. Depending on what kind of module you want to make you can write only an inference service. So only the second part, for example, when you want to use your own factorizer or you can also write VV8 module itself. Then you are completely free to choose the GraphQL design and so on but it means you have to write a bit in Go and actually it's not that hard. I just wrote this myself and made a pull request earlier this month of my first code in Go and I've documented this all on the documentation on the website. So don't be discouraged by the language. And yeah, so I wrote this in the named entity recognition model. And also of course you can get plenty of help from our community or from us through our select channel, for example, I will give you the link later in the presentation again. Now there's a few things to say about open source. VV8 is completely open source and always will remain open source. We have a great community. There's a lot of people using VV8 already in production and also we see that people connect their own machine learning models, so that's really great. And we have a very active select channel which you can join to learn more about VV8 and to connect with people who already use it. You can join it by scanning your QR code. Yeah, the select channel is very active where people ask questions and help each other. And we as makers of VV8 are also always very active on Slack as well to ask your questions and also receive feedback of the piece of software. And what is really cool is that we are still very busy developing VV8 and this means that you as a community have actually a great say in how it develops. So you can give feedback. You can make issues with new IDs. You can connect your own custom modules which we of course will be glad to learn about, et cetera. And of course you're always welcome to contribute on the source code even on GitHub. And finally, I want to give you some ideas how to get started with VV8. So on the website, we have a great introduction page with a video which will help you get started. And you can use like a quick start, which is a yeah, 10 minutes tour of VV8 in which you can actually run demo model yourself. We have this installation page with a, yeah, VV8 configurator. So if you want to start using VV8, you can use this, yeah, small application to customize and generate the compose configuration file. So that's really great. We have a lot of videos on our YouTube channel, so which is a semi. We have a lot of tutorials as well on the website, but also we made Google Collapse. So this is actually a community member who made a Google Collapse with VV8. We have four client libraries in Python, JavaScript, Go and Java. And they support all the existing API endpoints and even more. So that's really, yeah, easy to get started if you're familiar in Python, for example, because you don't need anything else. And then lastly, yeah, the Slack channel. So we are active on that. We announce new versions, but you can also get help there, of course. So to recap, VV8 is a vector database and vector search engine. With VV8, you can do the following tasks. So you can search through your data. You can discover answers to specific questions. You can automatically classify new data coming from any other data source and you can predict or recommend graph relations in your data. And of course, you can extend all these capabilities with more machine learning models. And VV8 can be applied in any industry, actually, because it's, yeah, it's industry agnostic. You can choose your own machine learning model, yeah, to run it. So yeah, we know people that use VV8 in, yeah, basically all of these use cases that are meant here. Of course, it's not limited to that. Okay, this was my presentation. I hope you liked it. And I hope you learned a lot about the open source vector search engine VV8. So if you have questions, I'll be happy to answer them in the Q&A after this session or otherwise you can always send me an email or yeah, join our Slack channel for more communication. Okay, enjoy the rest of the open source site. Bye-bye.