 Hi everyone, my name is Andrea and today I'll be presenting our work Wickey Web 2M, a page-level multimodal Wikipedia dataset. This work was done with my collaborators and advisers at Boston University, Google, and FAIR. We are interested in studying web page understanding, which involves many rich modalities like structure from HTML and the document object model, text data from lists, section titles and paragraphs, and images which provide additional information and context for web content. In order to study multimodal webpage understanding, we first need a dataset that provides all of these webpage modalities in one place. And we also need to define downstream tasks that can serve as a webpage benchmark suite. There are multiple motivations for studying webpage understanding. Content generation is gaining popularity. In particular, we may be interested in creating web stories that provide short educational snippets on topics or interactive visual summaries. From an accessibility standpoint, there are many use cases where automating tasks related to the web could be helpful. For example, captioning images which are missing alternative text descriptions or summarizing long bodies of page content for smoother browsing. Lastly, we can generally improve a user's interaction with web content by better representing web pages and retrieving the most relevant pages for a particular search. We carry a new dataset, the Wikipedia webpage 2 million, or wikiweb2m, which includes over 2 million English Wikipedia articles. We keep all sections, text and images and retain section structure to relate content from a single webpage. Compared to the most similar prior work, the Wikipedia image text, aka wit dataset, we retain more sections and images. For a direct comparison, we start with structural, heading and text-only sections from these web pages. Wit did not retain text on these sections, as the samples were image caption pairs. Additionally, we keep structural and heading sections, where structural sections have section titles and children's sections, while heading sections do not have subsections. Heading sections sometimes linked to other articles contain tables or are empty. Then we also have more sections containing images only, as well as both image and text. In total, we have over 8.6 million more sections. For images, we also have almost a million more total images, and show the number of unique and total, as an image may appear across multiple web pages. With our new wikiweb2m dataset of multimodal web pages, we need to define a benchmark suite of downstream tasks. Specifically, we define multimodal web page understanding tasks at three levels of granularity. First, at the element or local level, we can propose contextual image captioning, where we need to generate a caption given an image and the remaining web page context. Then, at an intermediate level or section level, we can do section summarization, where we need to generate a single sentence summary for a section, given its content and the remaining web page context. Lastly, at a global or whole page level, we can perform page description generation, where given the entire web page, we want to generate a relevant description. We can follow all of our tasks with the same T5 encoder-decoder framework, and use VIT to embed our input images. Here we illustrate what our model pipeline looks like with data from our section summarization task. In green, we have images and the target section highlighted. And in blue, we have inputs from other sections of the web page. Images are encoded with VIT and fed into our T5 encoder with the other embedded text tokens. Then, the decoder attempts to generate a meaningful summary. We show some of our experimental results for each task. We include blue for, rouge L, and cider metrics. For section summarization and image captioning, we compare performance when only inputting the target section versus all sections from the web page. In other words, for section summarization, we try only inputting the target section to be summarized. Or for image captioning, we try only inputting content from the section the image originated. In both cases, it is clear using all data available per page in WikiWeb2M improves performance. For page description, it inherently requires the entire page. And also is only made possible with our new data set. Thank you for listening, and please check out our GitHub for access to the data and extra details on the data set. We also will be releasing an extended version of our WorkSoon on Archive, and we'll link the paper on the GitHub page when it is available.