 Hi, I am Atatiyari and I am going to present our work on cross-lingual multi-sentence factor text generation for generating factually grounded Wikipedia articles using Viki data. First, let's try to understand the motivation behind this work. Our work builds on X-language, which proposes the task of cross-lingual factor text generation, in which English resources are used to generate content for low-resource Indian languages. It also puts forward the dataset for this task. Now, X-line focuses on only single-sentence generation, wherein, given a set of facts, you generate one sentence. We explore the problem of paragraph level generation, which is a naturally more challenging task. The more data you have, the greater will be your model's propensity for hallucinations. For this, we put forward several explicit measures for reducing hallucination in this problem. All in all, our work represents a step towards automating the pipeline of generating natural text Wikipedia articles for a given entity. The contributions of our work are summarized here. First, we construct a dataset for paragraph level cross-lingual factor text generation with a clearly defined train test split. This split is based on the two important attributes of coverage and covariance. Next, we investigate several methods for clustering of facts, namely statistical and end-to-end. Next, we investigate the efficacy of several explicit hallucination reduction measures for this task. These are coverage-based prompting, a rebalance beam search, and rewards based on coverage and covariance. Finally, we propose the experiment metric, which is an extension of the parent evaluation metric, which is capable of handling cross-lingual data. The merits of this metric are that it offers more interpretability in terms of precision recall, and that it is capable of handling divergent references. This figure describes the overall pipeline used in our work. First, starting with the baseline dataset, we construct a multi-sentence cross-lingual factor generation dataset by concatenate consensus. Coverage and covariance metrics are used to create training and testing split. Next, we come to fact organization, which involves fact clustering, that is, grouping different facts together, and fact ordering, which is deciding the order of these fact clusters. For all in fact organization, coverage prompts are added to our input to inform the model of the quality of the reference available. This is fed to a free-to-enemptify model, and RL rewards are used to guide the training. Next, grounded decoding strategy is used, which pays attention to the source, following which we obtain a natural language output, which is evaluated using experiment. Next, we describe the dataset use. As previously said, we use the existing SLI dataset to construct our dataset. We concatenate single-sentence examples for every entity in the order in which they appear in the article to construct a paragraph-level dataset. The dataset contains over 100,000 data points across 12 different languages, with several examples with more than three sentences. To create a high-quality test dataset, the instances were partitioned based on two metrics, coherence and coverage, which will be described in the further slides. First, coherence is described as a quality of being logical and consistent. This is necessary since the multi-sentence examples constructed are not always coherent. Since no dataset for coherence classification exists in Indian languages, a synthetic dataset was constructed in which pairs of sentences from Wikipedia articles were used. Positive examples were continuous pairs of sentences, whereas negative examples were randomly permuted sentences. A classifier was trained on this dataset, and coherence scores were described to each paragraph as the average coherence of every continuous pair of sentences. Coverage is a measure of how well facts are represented in the sentences. This is an important measure since X-Line is a partial-aligned dataset, as not all the facts in the reference are present in the source, as can be seen in the example here. For this, a dataset was constructed by removing facts from the web energy dataset and then manually annotating these samples for coverage scores. Next, a coverage classifier was trained on this dataset, and each example in our dataset was coded using this classifier. Next, we move on to fact organization. The intuition behind this method is that facts often occur together. For instance, date of birth and date of death are two facts which occur together in many Wikipedia articles. Explicating this organization can help the model generate higher quality examples. For a baseline approach, we used a bird piece classifier with the number of sentences trained with ground truth data. Then clusters of facts are obtained using spectral clustering with the number of clusters equal to the number of sentences predicted by a classifier. For an end-to-end approach, this problem was treated as a text-to-text problem, and an MTFI model was used to obtain the cluster from a given set of facts, as can be seen in the example here. This step not only gives the grouping of the facts, but also the order of these groups. Multiple generation methods were explored as part of our book. For a baseline approach, MTFI models trained on end-to-end generation were investigated. Two such models were used, one trained on single-sentence generation and one trained on multiple-sentence generation. In our fact-clustering approach, sentences are generated for each fact-cluster separately and then stitched together. Next, coverage prompts were investigated. In this, a coverage prompt of low, medium, or high was provided at the time of training. This value was obtained using a coverage classifier, and this informs the model of the quality of reference available. During inference, a prompt of high was always provided to generate texts that align highly with the reference. Next, a grounded decoding strategy that tends to the source was used. In this, the coverage score of each candidate token is added to its probability score of generation. To increase vocabulary overlap, selective standardification is used here. In this, Indo-Aryan languages are represented using the Dev Nagri script and the Ravidian languages are represented using the Malayalam script. Finally, to further improve the quality of generation, RL-based rewards were used. A reward based on coverage with source was defined using the entailment probability of correction with source, which measures how probable a token is based on the source if it isn't available in the reference. This handles divergence references. And next, a reward based on coverage with reference was used, which is defined using a blue-based reward. All these rewards are added for sentinel generation. Next, we describe the experiment metric that we propose. So blue and rouge are metrics that rely only on the reference text. This is problematic when the reference in the source do not align entirely. Parent addresses this problem by aligning the engrams from the generated end reference text to the semi-structured input. However, parent is defined for a monolingual setting in which all the three texts, the source, the reference, and the generated texts are in the same languages. For our metric, we use cosine-based similarity instead of exact string matching. This metric shows a greater correlation with human evaluation and offers more explainability in terms of the precision entry. The performance of the various methods investigated based on blue and experienced source is summarized here. As can be seen, each method offers an iterative improvement on the previous one with the most sophisticated method which uses factless string, coverage prompting, grounded recording, and green-source RL-based rewards presenting the strongest performance on both blue and experienced. That brings us to the end of our presentation. In conclusion, we explore several methods for improving the quality of generation in cross-lingual factor-text generation on a packed-up level. We also propose the experiment metric which is more nuanced than traditional energy metrics. Our proposed method shows improvement on multiple metrics across many different languages.