 Hello everyone, this is the International Introduction to Writer Development, okay? Hey, Miklos, I'm here with LePriPriZ for a while now. I initially worked on the right RTF important export as a student and then focused on writer in general and I now do this at Colbert. The focus for this talk will be to give you a kind of guide or tutorial on writer internals because it seemed that we have reasonably good reference documentation in case you want to go to this or that class, C++ class and understand what it does then perhaps it will have good oxygen documentation or we have some nice wiki pages or some readme's that describe this and that but how to get started, how to get a good overview is more tricky. Perhaps one of the resources is a similar writer course from like almost 10 years ago so it seemed like a good idea to refresh that and build some material that is up to date at least as of this year. So first when you do any kind of writer development then there are a number of tools that can help you. It's good in case you are aware of them because then you can search for them and see what they do, what if you don't even know that there is such a tool then it's very hard to search. So be sure that you are reasonably familiar with Git, doing Git log on files, running Git blame on chunks of code, doing Git by side to see where some good or bad behavior started is useful. Make sure that you can navigate the code weights efficiently. That might be with some ID integration. There is this CTAX or ID UTAX that can index the code base for you offline so that you can quickly jump to the definition of some number function. Be aware of the DOX LibreOffice org which is an oxygen output for the code base. There you can nicely see if a certain C++ class has which subclasses and stuff like this. Be familiar with the debugger. Only know it as GDB but it can be any single as. Note that the UNO API has specific debugging tools like this document inspector in the menu item or for example x-ray is a basic extension that can do this for you. Be familiar with the various units that we use. There is in the DevTools Git repo there is a typography conversion to which will quickly convert like centimeters to twips for you and so on. Be sure that you are familiar with these units. Be sure that you have at least one editor where you are comfortable with and it's focused on at the time. So for myself that's who I am but it can be e-max or vscode or whatnot but be sure that you are you are affected with that one. In many cases there will be some larger amount of data that you want to like print and interpret. Understand that we have this side debug macro which is for debugging output locally that uses the kind of built-in pretty printer of objects for so for example writer cursor position is pretty printed nicely. Also we use pretty printing in other cases like in odt or docx files it's it's really a one line or zip file in a xml file in a zip file. So in case your editor can do that then then you can open search zip files in the editor edit the xml files in place have pretty printing for them. Also for binary doc files we have a doc without one which will give you some more or less readable output of what's in the binary. And there are these various specifications for these five formats it's it's good to know where they are and in case some element or some some some markup is is unfamiliar to you then it makes sense to read it. So where is the code for writer? It has like writer core LibreOffice core has many modules. Here is a subset that is potentially interesting for for writer hacking. There is the SW module standing for style writer but that's where most of the code is. Some of the odf import export code is in the xml office module. There is a dedicated writer filter for the docx and rtf import or the shared docx part. There is the ox module and there is a format where the oxml import and export and yeah that happens for the for the math equations. Let's try to go through the various layers of writer so there is the document model which is like the the c++ classes representing what will be like edited by the UI and what's loaded from files and what serialized to files. So the document model is kind of the model from the smee model view controller paradigm. Our view is called the layout and in the code ways the controller is kind of the shell in case you want to have a mental model of that. SW docx is one open writer document and inside that the most important building block is a list of programs and we call those tax nodes so the SW docx has a list of nodes. And writer has then various other containers for the document model like it has a list of page styles, it has writer sections, inside paragraphs the individual pieces of like characters having the same character formatting that's tax portions. This is more or less maps to what vert has which is used as a reference in in many cases. So in vert has sections and sometimes we map those to page styles and sometimes to writer sections. Paragraphs are the same and on their side is these smaller things inside the paragraphs are quadrants but it's really similar to tax portions. So one question is how properties are stored inside the tax node. So the SW tax node has the paragraph tax is a unicode string and then we have each and every property on that paragraph is stored as a pool item. The pool item is something like it can store a string, a number, a boolean or something and it's stored in a nice thumbs up which is a container for these pool items. So it's kind of a map which has integer keys and it cannot basically any value. This key is called which ID and for writer there is a specific C++ header which contains all of these IDs. Then so what you see in the in the in the plus diagram here is at a layout level we will have a tax frame for the program and the document model level we will have a tax node for the program and that has a unicode string and then there will be the the various pool items in this attribute array for the for the paragraph and inside the frame we will have at the layout level we will have the paragraph the the portions we will get that in a minute. So a bit more words on this container for the for the pool items this is more or less it's a map it's a special one it knows what are the ranges of keys which are possible to store there if you don't if you want to store a pool item which is outside those ranges that will be ignored and you can debug that like what is stored there with the count that's you know for the character attributes that a separate container we call those hints and then hints may have just a starting point for example a field is always a single dummy character or you can have a start and end for example you mark three characters as both then you want to have a start and an end for them. Now in case you want to debug the document model you have a number of approaches and it's good to be familiar with these because in different cases different ones are are more useful. So one thing is that in gdb you can call this get nodes function and that will give you a list of paragraphs in the like inside gdb itself another one is that there is a document model axiomal dump that you can press shift 12 or shift after 12 or after 12 to generate the layout down for the document model dump and in the current directory there will be an smo file for you and that will contain c++ pointers and on on kinds of internal details and the last thing is that there is this extension called xray that you can run in it's kind of a improved print statement and you can run that inside for example the basic id and that gives you a good access to the various uno properties and an object and what functions are supported but in uno interfaces and so on. It's basically a debugger but it has domain knowledge about our you know concept and based on that sometimes that's a better way. I see that I'm a little bit behind schedule so I will skip the actual interactive demo what the point is that inside the sw module in case you're searching the ring wisdom you can you can find the instructions how to get the xray and axiomal dump working and it's a good idea to try that out with some two paragraph simple document. Now to the uno API this is something public so we try to not change it but at the same time most of the odl operator is using the uno API so whenever you add a new feature then typically you want to expose that in the uno API as well. The hope is that this is a bit higher level than the row c++ code which means that it's possible to refactor or change the c++ core without breaking uses of the uno API so there is a bit of separation between the two. Then I think a new feature my preferred way is that I always do uno API right after having the document model because then I can exercise the feature from a simple basic macro. Another approach is that to implement the UI early and then you can just use the UI that also works it has the downside that perhaps somebody will try the UI too early and we conclude that it's broken because of course the feature is not yet ready and complain so that's one problem there but it's fair enough to use that approach as well. Then for the actual properties the way we expose them on the uno API is that most pool item has a query value and put value function and based on that you can map uno properties to one part of a pool item. So one pool item is mapped to one or more uno properties type of comment. Then we get to the layout. A layout is basically a giant cache. You can see that in case our document model is just a list of programs then it's very expansive to actually paint pixels on the screen because you don't know where are the pages and so on. So what we try to do is we create a representation on how things will appear on the screen like what will be the screen position or the logical position of a program what will be its left top coordinate what will be the right one the bottom one where will be the pages and so on. We try to initially build this and later we incrementally update that and then based on that the actual rendering will be somewhat cheap because the layout already contains exactly what piece of text to paint where and so on. So there is this mechanism that for each and every node there is a frame so for text nodes you have text frames and so on. So here is some diagram on how things look on the layout. So there is one root frame and then in this case it's in the document has two pages and then inside that you can have headers, footers, body tags and inside the body tags you can have the text frames and these can be split between pages and there are various pointers to navigate with this graph with these three up down to the next one previous one and even logically connecting things like a paragraph at the end of the page and at the start of the next page it's a follow or a precede. Inside the paragraph there is no more frames it's called portions. We have a dedicated portion type for the inter paragraph then at the start of the line and the normal line portion on one item once held in this kind of table. Then we have notifications between the nodes to the frames. This is called this LW modifier client mechanism and in general this is like one model saying it can notify one or more layouts but then kind of recently we have added support for this layout level, a red line like change tracking, hiding and showing. So I know there are cases where multiple paragraphs are merged to a single text frame or a single text frame is visible with multiple single text nodes is visible with multiple text frames so this merge paragraph is one way to kind of realize this through that one model position is one or more layout object. Then one as an aside in the word case the complex geometry shapes like triangles and rounding corners and whatnot can contain complex content in writer core. These are represented with a combination of a draw shape with complex geometry and then there is kind of an invisible writer frame, text frame inside that which can contain the complex content but at the UNO API and UI this separation is not visible it's just a pair that works together. Then few tours is like storing the data with the document model and loading it back in practice we try to do a really really good job with ODF and then for every other format we try a good job but it can always happen that there is something in writer which is not stored to one of these formats. For the audio filter most of the code is using the UNO API and it's in this XMLF module. Some writer specific bits are inside writer core this filter slash XML directory but that's really rare so that's more like in case it's not in XMLF and you wonder what is the magic then this is the magic but otherwise it's not that frequent that you need to dive into that code as a bug site and otherwise in case we extend the ODF filter then we always try to write up some schema bits on like what is the new markup and you can submit ODF proposals to have your new markup included in the next version of ODF. Then docax is the other important filter because that's the majority of the new word documents are created in those formats and people expect that we are able to write those those formats so the import side is in the writer filter uses the UNO API there is this there is some code generation involved there I will get to that in a moment and it has its own debugging output this swd bug writer filter output which is some XML file which logs the traffic between the two steps of docax import one is some tokenization and then handling those tokens and the export side has a much simpler architecture it's a subclass from the binary doc export and it it's just a C++ code no UNO API no code generation for the short parts of the maximal like the docax filter the shape important export is shared with xlsx and pptx there is also the not the new R shape markup drawing ML but the overall shape markup VML that's mostly interesting for exporter because on import time we only look at drawing them out and also math is kind of leaving it on its own board so it has its own oaximal import fast parser is an important concept nowadays even the odf filter is using that and initially just the oaximal filter was using that so we deal with a lot of namespace strings and element names and attribute names attribute values but we know what are the possible values there because it's in the spark so what we can do is to generate a large map for this and then first we map the strings to an integer and then we work with those IDs later which is cheaper than copying strings around and comparing these so X fast parser does this and the other thing is that you can easily end up with a huge C++ class handling the entire import for the single for the for the entire file format and what we try to do is especially in the odf filter is that for each and every xml element there is a dedicated C++ class we call these contacts and then when we see a sub element or a child element then we create a child contacts for that and that's how it works so it's good to familiarize yourself with the concept to how debugging and now for the lookax import part some code generation is is happening at build time so what we do is there is a model axiomal file which is the relax ng scheme from the axiomal spark but then we annotate that with our own information what we do is we take xml elements and attributes we we map xml elements to this property modifier tokens and that can contain attribute tokens and the rtf import does the same and then later when we handle the tokens we can have shared code between the two formats and what we do there is once the the oaxiomal relax ng scheme is is there we add matching resource xml elements and that defines how to map these xml elements and attributes to tokens now for the binary doc file its format is the oldest writer importer and exporter we used to have this binary star office format that used to be even older it was in the bin filter module but it's now removed and the important export is somewhat shared because in many cases really the what the file format has is more or less a memory dump of the order word structures assume it makes sense that we have one c++ class which knows how to let's say map one word section to something writer and then the same class will take care of reading the writer document and building the same structure and then we will write it to the to the stream it uses the internal API which means that sometimes it's time you need to type a lot of code to do something somewhat simple on the other hand it has random access to the document model that sometimes useful when we import then we can easily look back and look around and make decisions based on that let's say we are ending a program and we want to know what is the last floating table that we imported is it encoded in us or not that's very easy to do with full access random access to the internal document model it would not be that easy with uno apm and here again there is some xls and ppt parts shared like the shape markup or the shape binary format that's in the ms filter directory of the filter manual and when i started working on the binary dog format it was like very very hard because it's not that readable and i did not really find a good dump or that would give you some more readable xml like output for some binary file so i i took this ms o dump or project which already had nice xls and ppt support and i did a log one so you might want to use that when debugging so rtf rtf is my favorite one although it's kind of underused by no so it's not that relevant except perhaps for copy paste the export is shared with doc and docax it's the same subclass as doc and docax on the export side import is shared with docax physically we take these rtf control words and we turn them into tokens and then the tokenize the these tokens are handled by the same domain map or mapping from the word domain to the right or domain that will actually par from those unocons on the mac side this is really similar to to docax the import will generate o xml tokens and the the export is shared with docax as well now let's say that you have some understanding or how the codebase looks like you make some change and you want to test it test it as in like you want to make sure that ideally no old behavior is broken and you want to log down the new behavior so that it remains fixed the filter tasks are kind of easy especially in case you work with the odf or o xml because you can the the export result is is some xml so you can use export asserts there or for the import result you can assert what's in the document model and for the binary or rtf format that's a little bit more complicated and in general the rest is more tricky the it's the easy case is when you can find a similar task which already does something what you want and failing that what we typically do is look at how the ui normally exercises the piece of functionality look at in the divago of what c++ function course it does and try to kind of mimic what the ui does on the in the cpp in a task and that's how we can exercise some functionality and then accept the result then for the ui what we typically do is the the code is kind of parallel to the unia api so the unia api is kind of a ui and the actual ui is is another one if we would try to build fill in some item set with various properties fire the dialogue and once the dialogue is down we get back some item set and we will update the document model then it's good to not forget that we have some flying hub that tries to describe and the the functionality of the office here so in case you did some change then it's always a good idea to spend five minutes to see if the chain the old behavior was described in the hub and then update that or in case you added something new then try to extend the relevant hub page so that users have an understanding of what you try to implement and i briefly mentioned this udf is not just a specification similar to like docx or doc or rto but it's something that maintained by oasis and you can submit proposals there um so what we do is um there is uh of course we we always update the udf filter like add the new functionality there but then we submit a proposal to oasis and then um then we we our intention is that the next udf version will will have the feature um in the standard and then other udf implementers can also um implement that in in in their implementation so i guess that was more or less it um and in the last line in this last slide i have some bookmarks for you one thing is this uh writer development wiki page um which has a checklist for new features and has bookmarks for the udf implementer notes and the other thing is the sw module has a readme for and has bookmarks for further wiki pages do read that like um we spend some time on filling that in the useful content and like and in the from the o o build times there were also two other um i kind of um overviews or writer call that is also interesting to read i guess that was it thanks for your attention bye