 Well, anyway, I will start then, because this is not so important, so a couple of words about me. And I appreciate your negative as a volunteer in several of the open source projects in my spare time, because in my real life, everyday life, I'm a developer and a scientist, and I do not do anything for open source projects except for those that we use in our day job in terms of office suite. But in my spare time, I am a time volunteer in a part of an office where I also serve as a project chair a few years ago and a few years ago. And I'm still focused on data visualization, text mining and data journalism as part of my job. So I try just bringing the windowpan office and trying to... They're coming back with a microphone. Okay, trying to see what we could do with that. Thanks a lot, sir. Sorry, because then you can use my pointers and everyone may not have to have checked. I think I was not working out with the microphone, but... Just to check if it works. Can you speak into it? I mean... By the way, it was recording also via the... Yeah, I know. Let's check it. There is nothing on channel 1 currently. Yeah. How are you, buddy? Pick. Run, test. Can you look him? I believe there is a pair button. There we go. There we go. Okay. Can I press the pair button? Can you check? Speak? No, not yet, not yet. Maybe you have to push here. It's still blinking. I'm sorry. Wait. Try speaking. No, no, no. Not yet. Tell me if I should... Try, try. No, I can't see anything. Should I push something? It's green now. Speak? No, I still can't see anything. I'm still empty. But it's green. It's green. It's become green. Shouldn't be a pair. No, not yet. Anymore? Try speaking? No. No, nothing. I'm sorry. Should I try? I just pushed again. A pair. Let's try. I don't know if it was a timeout. Not yet. Now it's not blinking? Try speaking? No. No, it looks like not. I'm sorry. What? Maybe it's muted? No, it's not. Okay. It will be really funny. No, nothing. I'm sorry. Just my voice in channel 2. Sorry, this is a bit unexpected. Does this mean that the previous thoughts were not recorded? No, I was here. It wasn't recorded. Okay. By the way, there is still the in-channel microphone, so you may speak louder. Okay, so if there is some traffic. I tried. So, well, enough about myself and let's really try to respect this schedule. There is a mining in general. There is a microphone. I don't know what to say. Here is this. I don't know. I don't know if it is thought. So text categorization, like I signed it over to a certain class, automated summarization of documents to machine learning. So trying to find a shorter way to say the same things or finding out what the important sentences are. And the sentence in general is much more common on social media these times. Understanding the sentence is positive, negative or neutral with respect to a certain program. But there is a more basic form of text mining that we can play with, even without special ideas. Like statistics on text, lines of sentences, most common words. They connect indicators about the writing style. They try to profile the writers and define their writing style based on numbers and these two writing words. So it is possible to have a fingerprint of an author by checking how frequent these users observe certain words or patterns. You can identify this button in the text. And the more basic form of text mining is visualizing the text in the form of a word. In general we would use Python as the preferred language because it's the one with the richest library set for all data sites, but specifically for text mining. With advanced use cases you would need specialized libraries and high computational power. But for the basic use cases, a local review there is more than that. A typical example of a basic text mining exercise would be read a text file, process the text appropriately, and then produce a word cloud that represents the picture the most common words we have seen. So this is an exercise I did with my students, with my well-communicated students. We produced an animated word cloud, with a value that was selected in the topic of the week. So this animation doesn't matter to me. Of course I'm speaking with that. Words become larger or smaller. According to popularities they had in recent weeks. So you can capture the trends visually, the news trends visually. But we are trying to see what we can do using the internet as a tool. That one goes down and may be the main part without any usage of an office. Our built-in feature basically has built-in parts. As Python is available within an office, we have the Python interpreter in the within office. We have the Python unit bridge which allows to basically internalize the two-part owner. You can use the text within part. One, two, three. Both sides. Okay. I'm not going back to previous slides. Let's just start from here. We have Python in Apache open office and we are trying to do something that's some simple text mind without the within open office. Getting the text is extremely easy. It is just one line of code. You go to the user's script or whatever you want to put your Python script. And the first part just gets the text of the current document. And it is trivial. I can show it in a proper text editor. But basically it's here. It's just a few lines of code. And then you have the text available. So from now on we are doing all the processing. We're using the internal Python. Then we will resort back to open office for the kind of visualization, the very, very low visualization we can have within Writer. Writer has a lot of limitations. So what do we do in the Python environment? Tokenization. We strip out our punctuation. Text is converted to lower case. Text becomes a list of words. We remove stop words. Because otherwise for any text to analyze we would see that and or they would be the most used words. And of course we don't want them to skew statistics. Then we filter to remove noise. The word that is only appearing once is the word that can be removed. This allows us to take a very long text with a variety of words and condensate it into a couple of hundreds of relevant words. And okay, we can... I don't want to waste time on this, but again it is a few lines of code here and you just get... There are some tricks to show. Okay. There are some tricks to properly extract the words and replacing everything with space if it is not available. Then with the counter object in Python we can very simply summarize our list and just extract in frequencies. So our list of words with repetition becomes a native Python data structure. Each word and the... like the word computer appears 34 times. And here again it becomes trivial to assign a popularity score to each word just to single out the ones that are really popular and the ones that are less popular. And this is what we do here. The stop words removal was already done as part of the previous step and basically here we just calculate the frequency and we generate the counters and we compute frequencies. And now we have to go back to... This is the kind of tricky part but still within an absolutely manageable complexity. Let's say that we start with an open office document and we want to append the visualization kind of a word cloud, the visualization of word frequencies to the end of this document. Then we use a cursor and this is documented in APIs and it is the missing link that brings back our Python objects to something that we can use in Python. So we generate the cursor, we go to the end of the document since we want to append the word cloud to the existing text and with the cursor we'll also apply style information. We are seeing it in a minute. Style information, why? Because a writer is very, very limited. For example, all text in writer has to be on lines. While the word cloud is a much more free visualization in general with the words that can be freely placed, okay, a writer does support floating objects but this would be totally overkill for our example as if we wanted we'd rather move to the engine of a drop and generate an interest slide. But here, if we want to stay within a writer we want it to be done in 20 lines, everything. We just can set these properties for the text we are going to write to append to the end of text and for each word we compute the right color and size so that it is a little bit nicer and all of this literally takes 20 lines of code. You know what is shown here. It is just down here. If you remove the includes basically it's a word cloud generator in 20 lines of code. The projector has a wrong resolution so it cannot show it properly but you go from here by executing the macro. This is a famous inaugural speech of President Obama and if we want to extract the key topics from here we just run the macro and this is what we get. It is still extremely simple but one gets the idea in writer. Okay, that would be it. So you get America, American nation you can single out the most frequent words and this is as far as you get in writer with just an absolutely simple approach like we did, which is this. Possible improvements how to use other functionality. Here there is a lot of stuff that we could improve starting from this very basic use case which is of course appearance now it is too static and there basically one would start with creating something with the drawing press engine rather than the writer engine and once you are there there you have native floating objects that you can place everywhere and with that it comes complexity. So now this word cloud is just a stream of words one after each other where we put the appropriate style to its word. There we would have to compute about the box for its word and make sure they do not overact too much. So it is a substantial amount of new work but that would be the way to do this. Basically we would need the algorithm that currently we just don't have just dump it all out in one line and this is for appearance but then we are within a word processor and this means we can improve the quality of our word cloud in ways that would be quite hard to do in other contexts. For example, OpenOffice knows about stemming, meaning recognizing in our example that America and American are basically the same word share the same root word. And we can do that since our spellchecker within OpenOffice already does it at a much, much population and I haven't tested it but if it was possible to interface with the hand spell engine that we use for spell checking this would automatically improve the quality by reducing each word to its most common appearance. And technically the stemming and lematizing, lematizing means choosing the right representative between America and American for example. Another thing that we could do in an excellent way if we wanted to improve this further is using synonyms because the word cloud inherently has a big limitation. It mixes up the significant and the meaning. The significant is the word you use to express a concept and the meaning is of course what you want to convey. So if I say brilliant and I want to actually say the same thing the two frequencies will be separated and there is no way to understanding it but we do have a synonyms engine so we could resort to it and trying to interface with the tutorials and try to cluster words with the same meaning and again this would be as the same process we've seen before but with the fact that we are now going into the semantic word trying to figure out what the word actually means and not just the way it looks. Another improvement that would be possible in theory is the part of speech tagging. We do have limited support for grammar checkers. Part of speech tagging means find out if a word is a noun or an adjective like a good boy or for the common good. And this would allow us to restrict our word cloud to only a text, for example making it much more homogeneous and much nicer to read for a real reader. The appearance would be the same but the meaning to a reader would be much richer. This is quite hard and the rest is probably feasible here we are going into something that becomes harder to do. Going back to stuff that we could do rather easily, sentiment analysis and other things we would be able to probably do and also the coupled from the word listings. Like one creates word lists with known, bad or good characterizations and this could be implemented with a similar statistical approach outside of the word cloud. And okay, we rushed a bit during the time that that's it. Thank you. If you have any questions or want to see the code or anything, please. Do you have kind of a range when you give ranking to the words in the actual implementation? So you give a ranking from 0 to 1. Is there a limit of ranking? Meaning that it's just one decimal, for example? Or is there a limit? At the moment. At the moment, I simply take the data and compute maximum and minimum so I assign 0 and 1 to cover all the font sizes I want to cover. It depends on the amount of work. Exactly. So I take the text and I stretch it to be always between 0 and 1. That's because I was... I say that I really... I have to say I love to use styles but I'm obssessioned by the styles. So the first thing that came to me is that you've been talking about styles but if I'm not wrong, you were doing direct format. Yeah, sure. Okay, well, this is a good concept. I mean, we're not... No, no, no problem. I mean, I guess... One could prepare character styles, indeed. Yeah, but I think that this applies... could apply if you have a limited number of styles. I mean, I mean, that's why the previous question... Yeah, okay. You can't place a new set on the other side. I think it's completely unusable. It's much better if you don't format it. Yes, if you want to use set styles, then you would... I mean, something to map categories, to map intervals to a certain style, yes. But it may apply, probably, in the future implementations of sentiment or, for example, you may use some kind of... Yeah, sure. We'll try something like that. Any other questions? Okay, we have time for one or two questions. Could I have a... Yes, there are more but, okay, it is a list of softwares. If you see it, the proof of concept is not really complete. Like here, for example, with you and you would have to be removed. What I mentioned, how the library functions to get this language dependent? Yes, with Python libraries, if you can leverage the full Python, then you don't need to hardcode stop words as a sequence, as a list of words that you make up yourself. There are existing softwares with most text processing libraries for Python and they are localized into languages. So that is just because I was doing it in 20 lines, but with the same approach, that's perfectly feasible. Question? Okay. So, in your code, what is this context script? What is it? Is it something you notice up here? Is it the interface to open office and grow this code that matters? That is a bridge that gives you the possibility here. You are in Python and you will run this within a writer, but this is the bridge that will get the text from the document to writer and the cursor here is the one doing the opposite thing, meaning bringing what you have and being able to insert it into the document. And these are the two, but the API is documented and it is just there by you. Question? No, grammar checker is not native in open office, so it is available through extensions and the extensions use either Java or Python, but it is not native engine. And I don't know what they internally use. I don't know what that is. Do you have a text to that one? That's okay. Okay. It's time to leave the floor, to the next speaker. Remove microphone. Yes. I will run the microphone. You know, the nice thing is that everything gets recorded since it is a Wi-Fi microphone. Overcomments. Thank you very much.