 Hello everyone, thank you for joining me today for this presentation. In this presentation, I will discuss why we are doing this, the kind of authoring tool we are targeting, and our approach to this task. My name is Zheng Hongyang, I am a PhD student from the University of Waterloo. This is the joint work with Carols, Raphael and Stefan from Neville Lab Europe during my internship, and in the project, we also have Krishna and Miriam in the development phase. Finally, Jimmy is my thesis advisor and it is our pleasure to share the work with you. Without further ado, let's begin. First, I will introduce some background of his work. As we all know, the multimedia content has become increasingly important in our daily lives, and it is essential for creators to produce high-quality content that is relevant and engaging to their audience. One of the main reasons that we initiate this project is the lack of satisfying image and text search systems and benchmarks. Specifically, the existing benchmarks such as the MS-CoCo or Flickr 30k have over simplified settings that may not accurately reflect the complexity of the real-world task. For example, the caption are generally generic descriptions that use the word such as men, women, or the thing, instead of the specific descriptions just like their names or the locations in the city. This could introduce some noise during the evolution just like the figure on the right hand side that is from the MS-CoCo dataset. As you can see, multiple captions could match the same images and multiple images could match the same captions just because of the under-specified details. In addition, for multimedia contents involving well-known public figures or landmark locations just like the media articles, we may need a beta benchmark. So the authoring tool we are targeting on is the image and text matching system that are used for searching for relevant images or text from the large-scale corpus. For the IR system, they can use just natural language queries, keywords or article contents. And for the recommender system, they can use a more diverse type of the data source just like the geo locations, editing history, etc. But we believe that there is a shared common interest in terms of the relevance annotations. So that is why we initiate this project. And what is our approach? Our approach is based on the Cranfield style task setting that would involve the query, the corpus and the relevance judgment. We are aware of there is the Wikipedia project that implement the image suggestion functions. We think that is more like a top-down approach. And in contrast, our approach is more like a bottom-up approach. So what is the pros and cons of the Cranfield style task? I think the main advantage is that we can compare different systems or techniques on the same task collections. And we could also build the automatic evaluation systems when the new technique or the new system come in the future because the purpose of Cranfield has to be a reusable benchmark. But there are also some non-disadvantage such as there could be a gap between the design experiment settings and the final user status factory factors because it is very hard to design the metric that is perfectly correlated to the user status factory. And it is also hard to model the interactive response in the multimedia content editing session with the Cranfield style task. So after talking about the experiment settings, let's talk about the test collection or the data set. We have the atomic data set with the paired resource paper that is also accepted to the CIGAR-2023 conference. The data set itself is publicly available and it is derived from the WIT data set by Krishna and his colleagues. We use all the images from the WIT data set and we parse the English Wikipedia for the task collections. For the relevance judgment, we release the sparse relevance judgment that is based on the normal image and text associations in Wikipedia. As for the dense relevance judgment, we expect to release them after the track conference. So here are some current results based on the sparse judgment. The first thing is that there's still the gap between the system that can access the captions just like the BN25 systems and those they already rely on the pixel values just like the clip model, blip model, or the flopper model. And scaling up the pre-train data set or the model size could help by only brings marginal benefit when we compare it to the caption-based systems. And the model they perform well on the MS-CoCo data set doesn't guarantee that they can generalize to the atomic data set. And there's also another important notice is that when we scale up the number of the images and document to be searched to a more realistic setting like millions of the candidates, the model effectiveness would degrade rapidly. So it is an important factor to consider when deploying the systems in the real world. And finally, the hybrid system can work better and the finding in the model on the provided training results could help. But there is still a large gap between the system to the caption-based system. Finally, this is the main purpose for this presentation is the call for participation and collaboration. You can join the track 2017 workshop with us and the submission is free. And we think the diversity of the Cranfield test is really important. So we are calling for the participant from different communities. You can stay updated by following our website and the contribution are always welcome. So let's build a community-driven tools like Wikipedia. Here is the reference that I used today and thank you for your attention. If you have any further questions about the data set and the evaluation or under the settings, you are free to send an email to me. Thank you for your attention.