 Hello everyone, my name is Michael Stahl and I work for Allotropia and want to tell you a bit about work I've been doing during the past year related to PDF and universal accessibility. So first definition of accessibility which as it happens you already saw a couple of hours ago in a talk by another Michael who also happened to talk about accessibility. So yeah but for the purpose of this talk we want to talk about document accessibility. So here the problem is that LibreOffice produces PDF documents and we want these documents to contain the necessary data so that a PDF reader and a screen reader that talks to the PDF reader can make the document accessible to visually impaired users. So first but relevant documents and standards are there that can help us in this task. So of course we start with the standard for PDF itself which was initially developed by Adobe but these days it is actually an international standard. The most recent version came out a couple of years ago and I recently found out that there is an organization called the PDF Association and they make this ISO standard available as a so-called sponsored standard from their website. So you can give up a couple of your personal data and download the standard document free of charge from their website basically. Yeah and the second relevant standard is ISO 14289 which is the PDFUA standard. That one unfortunately you have to buy it from ISO for a lot of money. Yeah but this is really important because it contains many requirements what a PDF document that is accessible should look like, should contain. Yeah and then other interesting documents are from the PDF Association the so-called Matterhorn protocol which is essentially a checklist containing many criteria where you can evaluate whether your PDF document is accessible if it meets them all and then there is the tagged PDF best practice guide which also contains many hints about how exactly the tag structure in the PDF should look like. It fills in a bit of gaps that the ISO standard leaves open and so on and then from another corner from the W3C there is the web content accessibility guidelines document which is mostly about HTML and such things but there are also some hints regarding PDF. So another thing that is very useful is that there are validators where you can check your PDF document if it is if it implements the recommendations and requirements successfully. So firstly there is Vera PDF which is actually open source and it's conveniently available as a flatback from FlatHub and its user interface is a bit clunky a java swing kind of thing which you can see here on the slide and essentially you choose the PDF file and then in the flavor box you select PDF UA and you click the execute button and it produces an HTML file with some warnings or errors. Yeah but what's important to notice is that a validator is not panacea so because the validator as a matter of principle can only can only detect if your PDF document is inconsistent but the actual problem of document accessibility is that the source document has some sort of semantically meaningful structure and you have to map that semantic structure into the PDF document and the validator of course doesn't know what the structure of the source document is so yeah there are many problems that can still be there even if the validator says you have no issues but it's useful and then the other validator that I've used actually there are more validators than these but and I've been told that some of them are even better but I have used only these two this one is called PDF accessibility checker and it's only available for Windows the main user interface looks like here on the left you have a basic overview of the of any warnings that have been found and then there are some additional features so with this results in detail button you'll see a window pops up and you can see every individual warning then you have a screen reader preview which essentially shows you the document structure without any graphical fluff around it and the most useful feature is here this logical structure button which opens the dialogue here on the right hand side and yeah this essentially has shows you the structure tree so 95% of PDF accessibility is that there is a tree of structural elements in the PDF document and yeah that has to map to to the structure of the source document and the elements in the tree should be in the logical reading order of of of the document and what you can see in this example is that there is one element selected here this span element and then you can see in the top right corner the text that corresponds to this span element on the page is is highlighted so you can very easily see what your tree elements refer to and then you can see from from the tree the parent node of the span element is something called standard and this is a paragraph element we name the paragraph elements in PDF after the paragraph standard after the paragraph style which in this case is standard and if you look a bit further up there are a couple of footnotes so for footnotes there is a special structure where we first have the label of the footnote in a label element label element contains a link element and the link element contains a link annotation and that means you can click on the footnote number in your PDF reader and it will jump to the anchor of the footnote and then what follows is a paragraph with the paragraph style footnote so yeah that's what the structure tree looks like then yeah most of the of the work was actually that while PDF while LibreOffice already produced the PDF structure tree there were many details that were wrong or things that were missing incomplete and we already had this PDF accessibility middle bug where many volunteers and also Gabor did a lot of work to find these issues over the years so thank you all for that and yeah this is just a list of the commits which I did which I don't know 60 or something like that to fix all of them won't of course go into detail here and we have also added a new feature namely Microsoft Office 2019 already had this and you can now also in LibreOffice mark your floating objects as decorative that essentially means that these floating objects will not appear in the PDF structure tree they will instead be tagged as artifacts and you can set this with the dialogue on the right in writer on writer frames images embedded objects and also on the frame styles and if you enable this decorative box then you can see here the description and text alternatives are disabled because if the element is not it's just decorative then those do not make sense and then we have also added this feature for the drawing objects and shapes and with this dialogue here you can set it in also in calc and in impressive and draw um okay so some of these structure tree additions were rather complicated so one example is the the writer lists and the label element that should contain the number of the of a list item turns out that the text formatting code is very complicated and the data structures it uses are rather non-obvious in some places with lots of subclasses and so on and the way the PDF is actually produced is by painting the text so the PDF is produced from the document view and not directly from the document model whereas the structure tree has to correspond more to the document model um so it was quite clear when I when I added this label element export that I'm probably going to do it wrong there are a lot of special cases that I don't know about um and so what I did is I added some simple asserts um about rather obvious things like if there is a numbering portion inside of the text formatting data structure then we have to open the label element once and if you have opened it once then we have to close it once and so on simple things like that and then we have this crash testing server where every couple of days some tens of thousands of documents are automatically loaded and exported and well this hit a lot of these special cases and eventually over the course of a few months I was supplied with all of these documents that actually contain contain the special cases and then I could fix them so this worked out rather nicely and I would have maybe thought of a third of these special cases ahead of time with with a lot of effort but never all of them and it's just really non-obvious things like you would think that the first line of a paragraph would contain the the numbering but actually the first line can contain only some special portion where the paragraph is blocked by a float by by some other floating object and non-obvious things like that yeah then another very difficult problem was the media shapes so we can have embedded videos essentially in in the document and there are some requirements for what this should look like essentially if you do it well in PDF so I made this diagram where every one of these boxes is a PDF object so at the top at at the in the left we have the there's an object for the page in the PDF file and this points to another object which is the page content stream and this is essentially the drawing commands that put everything that you see when you look at it in a PDF reader on the page so this is PDF drawing commands then in the middle there is an annotation object so to make this into actually a video you have to use an annotation object this is a screen annotation it contains some metadata like what MIME type this thing is and what its dimensions are and and appointed to some other object which contains the actual binary data of the video yeah so this enables the PDF reader to actually play the video and then on the right hand side you can see there is a structure element for the paragraph where the shape is anchored so for floating objects the way it works is that the floating object is a child in the structure tree of its anchor paragraph and in the lower right there is a structure element for the annotation itself and the problem here was that this this structure element for the annotation was missing and yeah in order to add that we have to have pointers to both the page and the page content where of course not to the whole of the page content but somewhere inside of this list of drawing commands there is there is this thing here anode with a property mcid this is a marked content and so somehow this structure element must reference to this marked content id 0 and then here between bdc amc this is what is actually drawing the thing on the on the screen and in addition to that we need a pointer from the structure element to the annotation element and also a pointer back from the annotation to the structure element and the structure element also needs to point it to its parent in the structure tree and of course in the other way so okay now why is this difficult so the problem i think is meta files how does writer actually generate a pdf file so firstly writer creates all of the annotation objects that are in the document with the help of a special vcl class called pdfxed outdev data this is the first step so and the second step is that writer paints its own paragraphs and so on with the vcl api and it records this to a vcl meta file together with this additional pdf data class then the third step is writer paints the floating objects and the floating objects are handled via the draw page and this creates a lot of drawing layer primitives and then the drawing layer paints these primitives via the vcl meta file processor to be and this is all recorded into the same vcl meta file there and then the last step is that vcl replace the recorded meta file which is now finished together with the with the additional pdf data to a pdf writer class and now i looked at this and was wondering what value is actually the meta file adding to any of this and i just asked armin like an hour ago and he said nothing it's complete nonsense the reason by the meta file exists here is that the pdf export is older than drawing layer that's the whole reason so if we go back here this thing in the middle is created in step one this part here is created in step three and somehow you need to find this part is created also in step three and this one is created in step two and so somehow you need to to to find these other objects that that were created at a completely different time so yeah this took me some time to get working yeah okay so what other changes were there in liver office 76 so everything related to the pdf export improvements should be in the 76 version not necessarily in 760 but yeah we needed to change the default version of pdf that is that is exported to version 1.7 because it turns out that some of the attributes that were needed in the structure tree are actually new in that version and yeah pdf 1.7 is like 15 years old today so we didn't see any problem with that then we changed the tag pdf option so that is enabled by default so a tag pdf enables the creation of the structure tree if you if you turn that off you don't get the structure tree in the pdf file and as I said 95% of this is just to get the right structure tree there which of course adds a bit to the file size yeah but we think the cost is worth it to enable it by default and then there is an additional option where you can turn on pf udimersal accessibility and this changes actually very little in the pdf file what it does is that it adds a pdf ua tag to the pdf metadata and then in the dialogue it disables various other options that are incompatible with accessibility like at the bottom there is something about reference x objects I don't know what that actually does but it's not allowed so we turn it off and then it runs when you click the export button it runs the accessibility check which is another dialogue that shows you warning about the document content because it turns out the pdf export is not solely responsible for the accessibility of your pdf file and if the user just puts something in that that is poor then the pdf export cannot magically fix it like for example as a requirement that the first heading has to be at level one and if you start your document with a heading level five then well it's going to be bad so the dialogue will warn about it yeah and if you want to know more about the accessibility check dialogue then you can go to this talk by my colleague Ballard which is in 90 minutes or so today about accessibility improvements in writer and another one of my colleagues somewhere is also talking about the accessibility check and namely how he made it possible to have it not only as a dialogue but also in the sidebar which may be more convenient uh yeah which is also today so I was not the only one who was working on this um right and of course none of this would have been possible without funding by our customer data port so thank you for that and that was all um are there do we have time for questions oh we must have time for questions using the the tag or the tag tree to then reconstitute document structure when a pdf is imported so the first question is is if you if you personally or you know other people who have had experience with the tagging and have started to are planning to work on improving the import filter to use this information and the second question is that if we can use such information even if LibreOff has been generated so if we can use like a tag structure that may be Microsoft Word or other common pdf generators okay so um yeah the structure tree just contains a lot of the structure but uh it does not contain um everything so um in particular with regard to formatting information most of that is is missing so for example for uh spans inside of a paragraph we currently export language and uh if it's superscript or subscript and if it's underline strikeout or overline i think that is all and yeah if you look at format character there are a lot of a lot more things you can you can do there so um i do without tagging doesn't even have the information of what's within the same paragraph yes yes that guess or or partition into lines or yes of course so so you would get a much better result than if you had no structure tree in the in the pdf file but you should not expect this to to be a perfect result another question where can we maybe experience how this tagging information is is used in in other than programmatically i mean how it would appear or be communicated to to people with disabilities seeing that how most of us don't have those you know more serious disabilities um yeah you can you can just use the pack which can show you the structure tree um i guess that is the most obvious way i know i think the adobe acrobat also has similar features i have never used it but it won't so it's suppose i'm blind obviously this won't won't help me i want i wanted as a non-blind person i want to experience how blind person would would experience the tagging in a reader or i don't know whatever is used to make pdf successful well you you could just set up a pf reader and a screen reader and then i think that is just the you have to just try it out i think that's the the how do you say that the proof of the putting is in the eating or something like that recommend is checking out the youtube channel and there's plenty of videos on the easy so the da is why all the caps and we'll get up there youtube channel has a lot of videos on it thanks uh yes there's also some kind of virtual team while we are editing the um at the moment no you just have to look at every single one um i believe they do but to be sure ask in the dark why ballash later