 Okay, hello, my name is Brett Halpern. I am the scientific director of the AI Horizons Network, and this is our seminar series. Our speaker today is Hui Hu, I've been researching Yorktown on interactive image search based on natural language feedback. Hui received her PhD in computer science from UNC Charlotte in 2015. She is a vision and image researcher, and without further ado, Hui, I'll hand it over to you. Thank you, thank you for the introduction. Alright, so I'll go ahead with the talk. So in this talk, we focus on the topic of image retrieval and why we are interested in this. While image retrieval or image search systems, they are very pervasive in our daily lives, ranging from retail or specialized domain, such as medical image analysis. Even though image retrieval has been intensely studied in information retrieval or computer vision, however, many commercial systems out there are still heavily relying on keyword labels or image metadata. So the reason because of this is that image retrieval is actually a very challenging problem. Alright, so I'm using this example in fashion search to highlight why image retrieval can be very challenging. Say an user finds an interesting look that inspires her, and she wants to find this similar look from an online store. So using text search, the user had to write down in detail the visual information of this image. And in this image, even though you might put down some high level descriptions, such as black lace dress, but this information can really miss important features in the image, such as the intricate lace position and also the silhouette of the dress. Okay. Also, many search engines use keywords to help filter out search results. However, these labels, they are highly generic and shared across a range of images, and therefore it is very hard to narrow down the search results. On top of this, a lot of the visual features that are meaningful to end users can be very ambiguous and subjective. For example, in previous study, it's shown that when we are applying certain visual attributes, there are high disagreement level among all end users. So having in mind all these challenges, we envision a new interactive image search tool that can attempt to address these challenges. Okay, so here is how we envision it works. At the starting point of the search, the user might choose to upload a photo or directly input a text query about the desired image. Then this initial input can help the user to find a very core ranking of the initial results. So the technology that can be used here are existing technology. So now we want to improve upon this. After we have the initial retrieved the result, we would like to give the user the ability to provide natural language-based feedback. And in this case, the user might choose to input a distinctive global color. Okay. So the results are updated and the user can iteratively provide input. In this example, user might say like the right one, but with a different neckline. And the result can be improved over iterations. So this is our task. We would like to design a novel interactive image search framework that allows the user to provide natural and flexible natural language feedback. And this is a very challenging problem. And so we choose a specific domain to test this idea. So we choose fashion search in this talk. So we choose this domain because fashion is a very large business and it actually inspired a lot of interesting work in computer vision. But the framework we propose in this talk can be generic and apply to other image domains. So now that we never done our goal, we would like to design an interactive image search tool that allows for natural language feedback. So how is it beneficial than previous image search tool that allows for interactive feedback? So interactive feedback for image search has been studied for many years. The most well-known technology utilizes relevant feedback. And here is roughly how it works. So given an initial search result, user might pick some images that are relevant or irrelevant to the desired search item. So the system will take this input, re-rank all the results, and the user will continue to provide feedback and this process continues. So the limitation is quite obvious because the user can only give this binary feedback. So the feedback the system can take reduces the information. To address some of the limits, there are more reason work that incorporates semantic information in this feedback. Specifically, they provide a list of attributes that the user can choose from. So when user are providing feedback, they can pick one of the attributes such as open or pointy in this case. And user can rank, can provide comparison between the desired image with the selected image on the dimensions of the selected attributes. And then the result will be re-ranked and process can continue. Even though by introducing these semantic dimensions, the user input is richer than relevant feedback. But still this method requires a careful curation of a predefined set of attributes, which can be very challenging. And in fact, in our experience, it is very hard to have a well-curated list with good user consensus beyond dozens of attributes. So now before we dive into the technical detail of how our system is implemented, we can see a demo of our work. So we recently showed this demo at CDPR demo track. So at the beginning, user can provide a text natural language query and the system will return top matched images. Then the user can refer to a specific image and provide relative feedback. So the results are refined at each step. So having an idea of how this system would run in action, let's look at the problem definition. So in our paper, we narrow down the problem to make it more simplified and can be addressed relatively easily. So we make some simplification. At each round, the agent will present one image, one candidate image to the user. And the user will look at the image and compare to the desired search target. And the user will provide a natural language feedback that describes the differences of the two images. The agent will take the feedback and present a re-ranked result. And this process will continue until the maximum round of interaction is reached or the user can end the query. So here is the network architecture. To simplify, we just show how the agent can return the next round of candidate given the interaction at the round T. So we call our agent the dialogue manager, who also has access to retrieval database of images. So the input to the dialogue manager at the dialogue turn T is the candidate image and also the user's feedback on the candidate image. So the first component is called response encoder. Essentially, it combines the representation of the user input and the candidate image and obtain a unified representation. So it contains an image encoder and also a text encoder. And we simply have a few multi-layer perceptron on top of both modalities. So the representation we get from this is called response representation, which represents the new information the dialogue manager has received at a T interaction. Then we would like to update the dialogue history. And for this, we have a state tracker, which is essentially based on a recurring network. And output is a history representation, which aggregates the aggregated information up to the dialogue T. So for all the database images, we also apply image encoding. So each image is associated with the image representation. Now, given this network, if it is probably trained, then at test time or inference time, we can simply take the history encode representation ST and simply take the one-year neighbor from the retrieval database and return that to the user. But how do we train the network? So there are actually two critical issues here. One is because this is an iterative process. So the dialogue manager needs to get iterative input from the user. Therefore, the data collection part for this system is actually quite expensive. Second, because we're actually maximizing the ranking objective of the retrieval system. It is a non-differentiable goal, and therefore it's hard to optimize using supervised learning. So let's look at how we address these issues. First, how do we obtain the training data for our dialogue manager? So to address this, we actually look at the user and the user role in detail. So actually at each round of the interaction, what the user does is actually look at two images, the target image and the returned candidate image or the reference image. And the user needs to describe the most obvious or prominent visual differences between these two images. So when you describe the test this way, this stack actually is very similar to captioning. With two images at the input. So inspired by this, we apply an image captioner and use it as a surrogate for real users. So this relative captioner can automatically generate sentences describing the differences between two images. And this is a new task. Therefore we designed a task and we collected a new dataset for this task. We leveraged Amazon Mechanic Turk to collect the data. So here what you see is the data annotation interface. Essentially we want to situate the annotator in a retail image search setting. And there are some background history already provided here that shows the interaction between the user and also the virtual shopping assistant. So the target image is always given to the annotator. And the annotator's role would be look at the returned image and try to finish the sentence that's in this text box. So notice that we actually provide a prefix of the sentence to the user. And the user just need to apply to finish this sentence by writing down a natural language based phrase. So by constraint this input this way, we can avoid casual input from the user that's irrelevant for search. So we applied this annotation interface for a shoe's dataset. And in the end we obtained roughly 10K 20 image pairs and 5K test pairs. For each pair of images we have one relative expression. Here are some examples of the relative expressions. And notice that these sentences they can be of different length. And for some easy pairs the user might simply use one attribute to describe the difference such as the first image. And interestingly when some images are very similar the user might choose to give very specific and detailed feedback. Like the second example users would say this shoe has holes on the top. So given this training data we applied a captioning model. So the captioning model takes as input the feature concatenation of both images. And here you can see some examples of generated captions from the user simulator. Well sometimes this user simulator can make mistakes just like regular image captioning models. But overall we find that they can serve as a good proxy for real users. And most importantly allow us to train the network with little to no annotation cost. So now we describe how we can use the user simulator to represent the user and train the dialogue manager. So how do we actually train the network? We actually follow a two stage of training. First we use supervised pre-training which can provide a good initialization of the model. And then followed by reinforcement learning which can directly optimize the ranking objectives. First for supervised pre-training we have the history representation ST. And also for training we have the ground truth image which is the target image feature. And we also randomly sample an image feature from the database. So we want to train the network parameter so that the estimated history of representation can become closer and closer to the target feature. So for the second step we would like to directly optimize for the final objective the ranking of the target image or the ground truth image. So here is what we do. So after we estimated the history representation we can search for the top K nearest neighbor in the retrieval database. And we can stochastically sample these examples and then we can continue this stochastic sampling for each round. And in the end we reward the system using the final ranking percentile. Even though we can optimize the network in this very basic way for policy gradient. However because we actually have the user model or the environment model that is the user simulator. We can actually leverage this known dynamics to have a more effective reinforcement learning process. So here is what we do. So because we know the user simulator after we selected the top K nearest neighbor for each one of these selected candid images. We can basically using look ahead search and unroll the entire trajectory of the dialogue. And then for each of the trajectory we can measure the final ranking of the target image. Then we can select the nearest neighbor that will result in the best ranking. Taking this as our optimal action at this dialogue dialogue turn then we can update our loss function using cross entropy loss. And optimize our network this way. So in our experiment we find out that this method can produce lower variance and converges faster than conventional policy gradient method. So here are some quantitative results. There are two sets of results. Revisited by the solid lines and the dashed lines. Let's look at the solid lines first. So these are the comparisons between different optimization methods. So the blue line is the pre-training using supervised learning. It gives pretty good results but it's still lower than two reinforcement based methods. So the red one is the reason method for image captioning. It actually achieved a really good result when it was published. So the green line is our final algorithm. As you can see it produces the best ranking performance. So the second set of results are comparison to attribute, relative attribute based feedback. We use the rule-based method feedback mechanism to provide attribute feedback. So the subscripts you can see here, 1, 3, 10, represent how many attributes will be given feedback at each round of interaction. And the deep represent relative attributes computed using deep features. So as you can see, there's a pretty huge gap between attribute-based feedback versus natural language-based feedback. In summary, reinforcement learning-based optimization methods for our dialogue manager resulted in improved retrieval performance. And natural language-based feedback is more effective than attribute-based feedback. So we haven't talked about our work we did roughly last year. And we felt this is a very interesting problem and this year we did more follow-up work on this. So one of the direction we took is to actually study how can we improve the performance when additional knowledge about the images can be available to us. So remembering our first work, all the supervision came from users' feedback. But in many situations, for example in fashion domain, there are other site information that could be incorporated to help with learning. So inspired by this, we actually leverage visual attributes from text. So specifically for Amazon product images, we scraped their website and obtained text metadata from the website. And we used a list of attributes to filter and obtain fashion-related labels. And we augmented relative image caption dataset with these attribute labels and presented a new dataset, fashion IQ. So here are some examples of fashion, this new dataset. On the left, we provided some linguistic analysis of the relative expressions. As you can see, the composition of the content and also the syntax of the relative captions are very rich. So we have about a third of them have comparative references between the target and the candidate image. And roughly 20% of them both direct and comparative visual attributes. And actually most of the captions contain more than attributes in one sentence. So compared to the SHU dataset we used in the first paper, fashion IQ dataset contains five times more relative captions. And it contains this information of real product images. And because the attribute labels for fashion attire or fashion apparel are much more easy to obtain. So this new dataset also allows investigation of using natural language feedback in conjunction with attribute information for training. So here at the table shows some statistics about this new dataset. And here are a few keywords we discovered from this new dataset. As you can see, it's very rich and entirely open-ended. So in summary, for this work we investigated natural language-based user feedback for interactive image retrieval. And it shows evidence that this is indeed a more natural and effective way to perform image search. But from this experience we also realized that this is a very challenging problem and we're just scratching the surface to address this problem. There are quite a few directions we are very interested in working on right now to improve this work. So the first one is the data issue. So as we said, the user simulator provides a very low cost way to train the model. However, user simulator itself is also a challenging problem. It makes mistakes and also it doesn't take into account users' personal preference or vocabulary, fashion expertise. Okay, so one direction is to how actually do we improve the performance of the user simulator. Second, right now with the problem definition, the user communicates while providing natural language feedback and the agent only provides interact using images. So it might be very beneficial if the user can also choose to ask clarification questions and actively seek out useful attributes to obtain feedback on. Also, in our new work we also experimented with incorporating site information and applied that to fashion IQ dataset. We found that it is indeed beneficial to incorporate site information for retrieval. But we're just starting this work and we think there are probably more interesting and more effective ways to incorporate the site information other than using attention mechanism. Finally, we think this framework can be very effective beyond fashion domain and another direction we look at is to extend this to natural image domain. So we think this is a very interesting direction and we really encourage people to join us and solve the challenges. So here are some resources that we are sharing. First one is fashion IQ dataset and the challenge. So we're hosting the challenge at ICCV 2019 workshop and the deadline is September. Also, if you're interested in running our code, here is the URL. And if you're interested in the general problem setting of applying natural language for image or visual information retrieval. So we are also hosting a workshop at ICCV. Okay, thank you very much. I found that very clear and I'm not a vision researcher so I thought very well presented. Thank you. So time for questions. If you would like to ask a question, please unmute. It's the little red microphone button at the bottom of your screen when you mouse over it. IBMers, please remember this is an open talk so avoid confidential questions. Any questions for play? So while people are figuring that out, I'll ask one. So I understand that the underlying mechanism takes a lot of work and clearly this is a multi-year effort. At some point you need to do user studies though as to whether they like interacting with this combination or if there's just an HCI aspect that they don't feel comfortable with. So do you have plans for a user study as well? Yeah, so we have some limited experience with user study. So we have not done very extensive work on that such as writing a paper on it. But as I said, actually in our first paper we did do a small scale user study internally to compare the attribute based methods and propose natural language based methods. And the differences are quite obvious. Second, we also showed the demo at CBPR demo track. So you just have a constant audience that come and play with it. So in general I feel like this is people like this way of this new method of interacting. By our research in early session images. But of course I think to actually have it to be real-world applicable when you want work both from the model side and also from the UI design. Okay. Are there any other questions? Please unmute and feel free to ask. It looks like not. So Clay, thank you again very, very much for the presentation to the attendees. This will be the last AI agent seminar for the summer. So enjoy July and at least part of all of this. We'll start up again in the fall as the students get back to the AI agent school. So thank you, Wei, and thank you all the attendees. Thank you, Brett.