 Okay, thank you for joining this talk about ODF normalization. I discovered since the last days that this is a very bad name because nobody knows what should it be. So I discovered a better name, but it's very long. Everything you never wanted to know about ODT files in terms written by LibreOffice. It's a little bit longer, but it's also not true. So I think this is better. One thing you never wanted to know about ODT files written by LibreOffice. So okay, it doesn't look absolutely correct, but I think you get the idea. What is the motivation? We have a customer project with a document generation system. They are using LibreOffice for creation of ODT files as templates. So they have templates and they are using special languages and engine for generation of documents. It's called FreeMarker and Chalk Reports. It's a Java framework. Generally it's just a template system. And this is a script language and using the script language, you can do ifs, loops, you can replace placeholders, you can view data and so on. So they are creating from templates, documents. So this is the basic part. So, and if you look inside the ODT file into the XML, you will see something like this here. You have a script element with an if, if and some condition, some text between and then the end of the if. So I think you get the idea. FreeMarker is running and does an interpretation of these expressions and replaces the script element. And then at the end, if this condition is true, you get this result here. So, one problem and the main problem for the motivation of this talk is in LibreOffice, let's say you have this text in LibreOffice, this is a screenshot from LibreOffice, word one, word two. Just two words, no different format, it's just two words. But it turns out if you look inside XML and the ODF model behind it, you get different presentations. All these are valid representations according to ODF specification. So for example, you get a paragraph and a text span between or you get a paragraph and two text spans or you get a paragraph without text spans or you get the last one is the most interesting, you get a paragraph and a text span and a text span and a text span and so on. It's all valid and it could possibly one possible result. So, it turns out the tree structure of the model is not predictable in this case if you are interested into the structure of the document and okay, that's okay if you are using in the complete chain of your document generation, LibreOffice, then it's not a problem. But if you are using different Composer, the ODF Composer or different, for example, automatic processing tools which are only working on the ODT files, it's not so easy, it's a challenge. One example, you have a conditional part here, if and the condition bool value and this is not a problem because if you see if this condition is false, then most of it is turned out, it's thrown away, so everything not a problem but if something like this happens in LibreOffice, you get a problem because the closing if is one level, the depth is different than the first expression and you get, if my bool value is false, you get something like this and this is not a valid XML file. Okay, so that's the motivation, that's the problem. Additionally, there are additional problems. If you want to have differences of ODT files and you look into the XML and you do some simple changes in the document, you get probably, you get many, many different structures and a really large diff and it's not so easy to understand what was the semantic change of this. You get only a bunch of XML changes. Also regression testing, if you are looking into the XML and you do some testing, automatic testing, then you get sometimes different structure and you get false positives. Also interoperability is a problem in this case when you do document generation without LibreOffice. So what can we do, different solutions here? For example, you can say, okay, don't do it like this, throw free marker and so on away, but yes, you can imagine it's huge investment and so on and the second solution, possible solution is fix the most annoying issues on this case and this is what I call normalization. We do, we have a ODT file with, I call it unpredictable tree structure and I do some transformations and then I get a normalized structure and I can use it for automatic processing and comparison. Okay, let's have a look. This is one example for you. You have text spans inside text spans and you say, okay, if I look at the attribute values, essentially they have the same values. So there is no difference in formatting, it's just a XML difference and no difference in the viewer in LibreOffice, you cannot see a difference. So you can say, okay, I merge these two text spans. So this is one possible solution for this. You can, if you are doing something like this, you have different solutions for this. You can use style names for comparison. You can say, okay, text span has style A and the next text span has style A so you can merge them together. But additionally, it would be very nice if you compare the values of the attributes. So the idea is a style resolver. So trying to find out which attribute values are actually used by the ODF composer, in this case LibreOffice, and are they equal, all the values? So just as a simple explanation here, style resolver, how does it look like? You have here a style and in ODF, all styles have parents and the parent could have also a parent and so on. And additionally, there are default styles and there are defaults of the ODF consumer from the, these are implementation details but usually they are not necessary. And the style resolver has to merge all the attributes all together and then we can compare the elements and the attributes. So in this case, we have a style resolver and we can say, okay, this element, this element has the same format. Additional example here is merge identicals, identicals, siblings, they are neighbors and you can say, okay, these two neighbors, you can merge them together and an additional transformation is this case, you have a text span inside a paragraph and you can remove this text span because there's no additional information. The formats are identical from the paragraph and the text span so you can remove it. Okay, so this is a very special solution for a very specific problem. It's not a generic solution. So if you want to do it for all the possible cases in ODF files, then you have a lot more to do. But it's an evaluation, a first step here. You can do more transformations. You can support documents from different ODF composters, different products. For example, different languages. If you have different languages, you have to convert the values, centimeter and inches and so on, names of default styles and so on. These are typical problems. You can try to merge styles if they are not necessary and so on. You can create predictable style names and so on. Yo, okay, so this was very short, thank you.