 Hello, my name is Prasanna, I am a research software engineer in the visual geometry group at the University of Oxford. In this talk, I am going to present our ongoing project, the Vice Image Search Engine, along with my colleague, Horace. Metadata is not always enough when you want to search a collection of millions of images. Consider the Wikimedia Commons, one of the largest openly available collections of media on the internet. Today you want to search for a penguin in this collection. You get around 20,000 results of cute penguins. But if you try to narrow it down to find a penguin with wings raised, you get only eight results out of which only one is relevant. Does it mean there are no penguins with wings raised in this collection? Here is an image of a penguin with its wings raised from Wikimedia Commons. And though this image is relevant to our search query, it was not returned in the results because the metadata, as you can see here, does not have the keywords that match the query. The words wings raised do not appear in it. So how do we find this and images of other penguins with their wings raised? How do we find images that are invisible to the search engine when the metadata is missing, incorrect, non-exhaustive, or in a different language? To help tackle this problem, we introduce VICE, an open source AI-based image search engine. Let me show you a quick demo of VICE running on a 35 million subset of the Wikimedia Commons collection. We can not only retrieve images for queries like a car, but we can also be more specific, for example, a red car, or even a sports car. We are able to find all these relevant images, even if the metadata, as you can see here, doesn't contain the keywords. That's because in VICE, the search is based on the content of the image. The images whose content closely matches the search query are ranked higher in the results. We can freely describe what we want using a complex search query, for example, person riding a bike or a horse, and we can even go further by making the horse jump. So when we search for the query, penguin with wings raised, we are able to find them. Let me show you a couple of use cases for this content-based search engine. We can use the search engine with multiple text-based prompts to curate images for training a classifier. We can then use the classifier to categorize new images that are uploaded. Here I show an example of a query, pages with illustration of a tree, to train a tree illustration classifier for book pages, but it could be anything. We could build a content classifier to find explicit and not-safe work images to help with content moderation. We don't have to always describe what we want in words. We can also upload an image whose content represents what we want to search for. Here I'm showing a screenshot of the results obtained from the search engine when I upload this image of a cliff. We are also working on adding the ability to search by combining both the text and image-based queries. My colleague Horace will now dive into how the search engine works and share with you some other details behind the implementation. Hi, I'm Horace and I will now explain how WISE works. WISE uses vision language models which map images and text to the same feature space, allowing users to search images using natural language. One of the ways in which these vision language models are trained is by taking pairs of images and their corresponding captions as shown here and training an image encoder and text encoder to map the images and text to feature vectors such that the feature vectors are close together in feature space for matching pairs of images and text and far apart for non-matching pairs. In WISE, we use the OpenClip VIT L14 model, which is a vision transformer that has been trained on the Layon dataset with 2 billion pairs of images and text. In order to use vision language models to perform searches, what we do is we first take all the images in a given collection and we use the image encoder to map the images onto a high dimensional feature vector space. So now each image is represented by a vector. Then we take the user's text query, such as person riding a horse and we encode it to the same vector space using the text encoder. Lastly, we find the images whose vectors are closest to the query vector using cosine distance. These images are then returned to the user as deserves results. In terms of implementation, for the demo we showed earlier with the Wikimedia Commons images, we have indexed around 35 million images in total. As this is an ongoing project, we expect to continue adding more images over time. The features were extracted using the OpenClip pre-trained model. It is important to note that WISE allows users to use other models as well in their own projects. The index of the feature vectors occupies around 100 gigabytes of storage. In addition, we use approximate nearest neighbor search using the Facebook AI similarity search library, which uses techniques like clustering to improve the search speed. As a result, we are able to achieve the search times within one second on the server side. In order to quantitatively evaluate the retrieval performance of WISE, we construct a custom benchmark. We manually review the search results returned by both WISE and the Wikimedia Commons search engines. For example, here we showed the top eight search results returned by the Wikimedia Commons search engine for the query pencil drawing of Lyon. We then use an annotation tool to mark whether each image was correct or not given the search query. Using that, we can then plot the precision curve, showing the value of precision for different values of top k. Here we show the precision curve for nine different queries. As you can see, the WISE search engine has higher precision overall for these queries compared to the Wikimedia Commons search engine. One of the limitations of WISE is that the underlying model used by WISE is biased by the training data. In addition, there are safety issues such as the model returning unsafe results. As such, we are working on different ways to mitigate these issues, such as by having a list of unsafe queries that should be blocked, as well as hiding images behind a warning for images that are classified as offensive or not safe for work. WISE will be published with a permissive open source license, allowing it to be used for both academic research and commercial products. VGG will be able to develop, maintain and support WISE at least until 2025. Thank you for watching.