 Hi everyone, how is the volume? Is it okay? Yeah. Welcome to the LibreCon conference and welcome to our presentation about integration of large language models into the LibreOffice online in our case. Following Sarah's presentation we will show you some different kinds of integration. So my name is Irina and this is the team that worked on on the proof-of-concept that we are going to show you today. Andre who is right here next to me and who will be presenting most of the technical part and Stefan who is somewhere in the back and he helped us that with the frontend integration into LibreOffice online. So this is the team and maybe some of you are wondering how and what we have to do with LibreOffice. So what how we are using LibreOffice and actually we are using Colabora online since three years ago. We have it integrated into our online office product into the WebDE and JMEX portals and this is just a print screen of our online office tool inside WebDE. Very short, this is the architecture that we have it installed in Made in Media. We have our own Kubernetes clusters, our own data centers where we have it installed and if you are curious about more details on the technical part of our integration you can follow my colleagues who are the presentation from Almedia. It's on YouTube. There is a link here. If the slides will be shared you will find the link here and you can find all the details of our integration. I think we were the first to integrate LibreOffice online into a Kubernetes cluster. So it would be interesting for anyone who wants to know more details about this. But our presentation today is actually focused on integration of large language models and for different use cases in LibreOffice online. So let's imagine that maybe you have a contract or maybe you have something like a legal document that you want to go through or you receive and you don't really have the knowledge of legal terms or maybe you have an instructions manual that you want to quickly find something inside or you want to unblock yourself very quickly in a certain situation or maybe you have a technical documentation like a very big 500 pages documentation of a new language that you have to learn because you need to do a proof of concept or you need to do an application very quickly. So maybe in any of these cases maybe you can just open the document inside LibreOffice or inside the Polybore online and you can just ask all kinds of questions and ask the language model that you have integrated to help you with this. So this is what we try to do and now Andre will lead on with describing all the technical details of our proof of concept and also show you the demo. Okay, thank you Irina. With the latest advancement in the machine learning areas, also my previous speaker presented, we can actually do something related to the use cases that I mentioned. And by latest advancement I mean large language models. As you already saw on some presentation, large language models are language models on steroids. They are very, very big. They have billions of parameters. A parameter is something that the model need to learn. For example, a weight or some fine tuning. For example, GPT 3.5, one of the model that OpenAHPOS has 175 billions parameter. Some open source models like Bloom come in different flavors, 175, 7 and 1 billion parameters. Lama Doom comes in 7 billion, 30 billions and 7 billion parameter, Spellpoint and MBT. For this demo that you saw today, we play with all these models and we picked the best one for our use case and it was actually Lama 2, 30 billion parameters, the model that performed the best for our use case. An interesting stuff about large language model is that they are very, very generics. They do only one thing. They receive an input and give you back an output. Very generic. So however, how can we do different use cases with so generic large language models? And we do that using prompt engineering. Prompt engineering is a technique to instruct your large language model what we want to expect from it, what it needs to do. For example, we have a prompt engineering to do summarization. Instead of just sending the document to the model, we instruct the model, hey model, write a summary of this next text and give me the summary. This is a prompt engineering for chatbot. Again, I can say the instruction, hey model, this is a piece of context. Use only this piece of context and answer to my question. Do not invent anything. This is the question, give me the answer. And how does the model know to answer this? Well, it is a very interesting stuff that when you grow in size to the model, you actually have some emerging capabilities. The one that also the previous speaker explained. For example, when you start to grow with the model, you get patient answer capabilities, summarizing capabilities. And what is interesting about these capabilities is that they do not rely on the knowledge base of the model. They are just capabilities. And this is something very interesting. And this is why I love this use case of chatbot or ask your document. Because on this use case, you do not rely on the knowledge base of the model. No matter how good is your model, we believe it is GPT-4 with trillions of parameters, it is limited and hallucinate. And you don't want to rely on the knowledge base of the model. You want to inject in the model your knowledge base and rely on the capabilities of the model to understand your knowledge base. And of course, again, this will not be possible without a very big community. You don't imagine how big is the hugging face community. And if you go to the platform, you have hundreds of thousands of open source models, not only language models. You have a lot of libraries, very generic, very easy to use. Just switch the model and you have the code up and running. Of course, open AI and closed source model have sometimes some advantages. One of the advantages of closed source models is the accuracy. GPT-4 or GPT-3.5 is very accurate. For a simple reason, it's very big and it costs money. But it's useful for some of the use cases, also how we saw previously. However, we have also some open source models that are very close to open AI in terms of capabilities, like Lama 2.7 billion per meter falcon, 180 billion and MPT 30 billion. All of these three models are very close to open AI models. None of them is as good as GPT-4, of course, but some of them are even better than GPT-3.5, the free version of GPT. Now, of course, open source model has a huge advantage, which was a must in our company, the privacy. Open source model running a Docker image on your infrastructure in isolation. You just don't share anything with outside. You deploy your model in your Kubernetes clusters, scale it up, send the document to the model and that's it. When you privacy is very important, open source model rules in this case. This is our integration with Chatbot and I will show you a demo. For now, this integration is actually, so this integration is accessible only via VPN, so I cannot show you something live. What is here is a legal document and you saw this new button here, Chat with document and when you open this button, the entire text of the document is sent to an open source model, deployed in our infrastructure. You get a summary and then you can ask questions about your document. For example, here, I'm asking you because this is a legal document, what parties are involved in this legal document? It's responding very well, everything is working fine. Then I ask a few more questions, like for example, for what countries the document is for Romania, it's an NDA, especially for Romanian legislation. What is the duration of the contract I asked here or the agreement? And of course, it's extracting properly from the document. What you see here is actually sending all these questions and the document to a Docker image that is running with Lama 2, 30 billion parameter version. It's not the biggest Lama, it's not the smallest Lama, but we did some tests, we did some benchmarks, we did some, also some evaluation on the quality and Lama 2, 30 billion was the right option for our use case. Now we have another document, we tried three documents, first was the legal document. This is a story, it's the last question from Masivov, a very nice story. Again, you get the summary of the story, then you can ask a question about it. Our main use case here is not code. This model is not supposed to write code. So if you have, for example, some source code here and you ask the model, it will not be answering properly. This is for legal document, this is for, let's say, text document for chatting in general. Okay, this was the last question from Masivov. You will see a little bit later how we do that. I will show you a bit of architecture and there are some tricks in order to do the summary and question answering to work as with an open source small model. So imagine that this has 13 billion parameters. OpenAI GPT 3.5 has 175 billion parameters. GPT 4, we don't know exactly because it's closed also from the architecture point of view. But this model is quite good for our use case. Okay, and the last document that we tested again is a technical document about the code language. And I asked just a few questions related to the code language, not the right code. We open the icon to ask the model. Again, the text is sent to the model, we get the summary of the document, and then we can ask a few questions about the code language, what are the advantages, the disadvantages, and when the model. Okay. And the nice thing about this is again, it's working with the knowledge base of a user. It's not the knowledge base of the model. So when I'm asking now here, when was the goal released? The answer is not from the knowledge of the model. It's from this document. If I put in the document something else, it will answer based on the document. And this is very, very useful and you'll see it a little bit later into the slide. Why? Okay, so you saw the demo. Now the architecture. But we have the generic part. The generic part are some local images with hugging base code that run the models. And this is very, very generic. If use case agnostic, it can be used to actually implement any use case that you saw previously that language model are capable of, like translate, create content, improve grammar, and of course, summary and ask the document. Then we have the, this is use case specific part. Use case specific part is some pipeline written with long chain that allow us to inject into the model our custom knowledge base. And this is not specific to our use case of integration into the liberal office. Because today, we are you saw the demo with the document from the user. But tomorrow, let's say that our customer care has a knowledge base with guidelines about how they can answer to user ticket, you can inject that knowledge base into the model and create a chatbot for customer support. So it's very, very specific. Any use case that need your knowledge base instead of the models knowledge base can be implemented with these pipelines. And of course, we have a liberal office specific part with UI that you saw, prevented by Stefan previously about the pipeline that you saw. I just want to detail a little bit and why we do it. Large language model, all the large language models are actually limited in type of in size of the input, you have a context, limited context input size. For example, the model that we use can receive maximum 4000 words or token, not quite similar, but let's say that are similar. So a user document is very big can have hundreds of pages, how you send that document to the model, you don't send it. What you do is actually you index the document into vector database. And when the user ask a question, you actually send the question to the vector database and extract from the entire document, only the part relevant to your question. And you send to the model, a prompt engineering that you saw previously, which is saying the context and the context is this chunk extracted from the entire document and the question. So this is a mitigation of the first limitation, the more than input size. Yeah, as I mentioned, Lama 230 billion that we use for this demo has four K tokens input. MPT has a K token is also most AP train transformer open source large language model. And as you saw mitigation depends on the use case. For question in answering the use of the mitigation was vector database. For summarization is a little bit tricky because the document can be very big. And you need to do something like map reduce. And this is what you do. You chunk the entire document, you summarize if chunks, and you do you do again this process until you actually get the final summary. And this is one of the techniques to mitigate. So this was imitation one of four, we have for limitation, not boring hardware models, or eat a lot of GPU memory. For example, a 7 billion model is around 30 gigabytes. A 7 billion model around 160 gigabytes. We also deploy Lama to 7 billion 123 hundred and 20 gigabytes is a lot. How you can mitigate this, of course, you can buy better hardware or more hardware. But it's not always an answer, right? It costs money. There is a very interesting mitigation, called optimization of the models for low memory. And there are some optimization technique that are not detailed here, that allow you to run big models on low infrastructure. Just for example, ggml and quantization quantization mean to actually load your weight of the model, if thought, instead of floating number on 32 bits on importing number from eight bits. And suddenly the model become much, much smaller. Of course, depending on the ground on the quantization method, you'll get some penalty on accuracy. No free meal. Okay, the three or four limitation, the speed and latency, and this is really, really unbelievable. The VLLM. The VLLM is the technique to optimize the large language model that is actually providing a time latency and 25 throughput. And these are the number from the internet benchmarks, and we also test it internally. And it's actually like that, you can get a time latency improvement on the model. And what you saw previously on the demo, it was using the VLLM version of the model is really fast. Of course, it takes a lot of more memory to run this model, because they are doing some optimization of the memory on the page attention, and so on, on the batch processing of the model. Just to see some comparison, the hugging face playing or hugging face framework versus the VLLM in terms of throughput request means. And the last limitation is the hallucination and the wrong answer. And this is the second reason why I really love this use case is that when you use the knowledge base of the models, the models has a lot of hallucinations, or the high level of hallucination. Hallucination means that the model is giving you is inventing answers when it doesn't know. Then what is happening is that you get the wrong answer, right? And we have a mitigation for that. The first mitigation was that we didn't use the knowledge base of the model, we use our knowledge base, and the hallucination level dropped. However, you can still get hallucination. And there is an unbelievable simple method to eliminate hallucination. And it's very, very simple, very, very useful is that when you get an answer from the model, you also return the source. It looks very, very simple. But imagine that you have a lot of big document with a lot of chapters, or you have a big PDF with user guide, you get an answer from the model, and you get the chapter or the URL to the source. And the user can fact check the answer. Can say, okay, it's bullshit, it's hallucination. Okay, yeah, this chapter is actually helping me for my question, right? So it's simple and it's a method to export the detection of hallucination to the user, not to the model. And it's very, very nice. There are many other options to eliminate or to reduce the hallucination, better problems, better pipelines, less creativity. If you reduce the creativity of the model, you can do more benchmarks, and you get less hallucination. And of course, you have a consistent database of knowledge. If you have a user guide for your customer support, and the information is spaghetti, with the factor that are not consistent, you'll get inconsistent answers. And the last thing that you can mitigate for hallucination is to use something that is also used by GPT4, from what we suspect, is mix of expert. It means that you can use different models for different use cases. For example, we can use MPT for summary, because MPT has 8k tokens length and input. And we can use Llama 2 for question and answer, because it's faster, it's slim, and it's intelligent. It's smart. The technology very quick that we use, of course, open source model, bloom, MPT, Llama 2 benchmark, high-gain phase Docker, and Python, for the reason that you're so optimistic, and also because the Python is together with R and CR, the language is used for large language models. Next step, optimization, we need to do some more optimization on the use cases that you saw, to do some benchmarks, benchmarks that will be used to actually buy the infrastructure or scale it up the models. And yeah, anything is nice in our head, but we need to get the feedback from our real users. So we need to expose some beta users to these features to see how they consider that is security, the usefulness of this. And of course, this will not be possible without the community. We brought these. Thank you. So before we have any questions.