 Oh, that's awesome. OK, so I'm going to let you continue with your presentation. Awesome. Yeah, all right. So let's start. So 15 things you should know about Spacey. I'm missing a little bit the crowd interaction here, since usually I would ask about your experience and what you know about Spacey. Yes, just like to get a grasp of the experience. So I'm just assuming there is at least some people who are not like super experts in natural language processing. So let's just like start a little bit about what is natural language processing? Because probably you read a lot about AI, natural language processing, speech assistance, Alexis, and all these things. In the news, the media likely use one of the services on a daily basis. So NLP is natural language processing or abbreviated as NLP is basically there's a lot of data, text-written text, spoken text, whatever. It's a real avalanche of data, especially since we have the internet, there's more and more data. There's not just like books and letters and postcards you write to each other. Yeah, there's all the internet and all the stuff now. So in general, basically, there are two types of data, structured data and unstructured data. And only 20% of the data we have is assumed to be structured. So the vast majority, 80%, you see on the lower side is unstructured and text. Any verbal or sign communications, pictures, movies, x-rays, pink noise, singularities, all this is unstructured data. And this is the vast majority. In fact, we don't really know. It's just like an estimate, the old 80, 20 estimate, which usually works really well. Some applications of natural language processing you are familiar with, like chatbots are really, we're really hyped, so it's called dialogue systems. There's machine translation, which is pretty good for many languages nowadays. Sentiment analysis, so how do people feel when they write? And also like trying to catch sarcasm, speech to text and vice versa. So that dictation or having the computer read a text to you. I think one of the classics is actually spelling and grammar checking. So you probably have this forever in your text editor and there's other things like text complication. So starter sentence and the computer suggests how to continue. There are many small things in text data. So a lot of knowledge, thoughts, philosophy, guidelines, laws, regulations, poetry, love letters, everything is text data. And beforehand, we tried to linguist tried more a rule-based approach in the past. So because we have languages, languages do have a grammar and we are building on grammar, you can build rules and a lot of stuff. So like analyzing text or working with text was a lot of handwork because you have to make up all these rules or you have to not really make up, you have to do research, how is the language actually being used because another big challenge for languages, language is changing all the time. There's always like young people inventing new words of developing their own language. In Germany, for example, we adapt many words from the English language into our own language. So language is always changing. And so it's really hard to keep up making all the rules and to do all the research, to put it in form. And there's also an explanatory statistical set which you see like, this is like a field of data with many flowers and blossoms and guess who won? Of course, the dog and one, I didn't find a better picture for this. So I used Ion Oswald's dog and thanks, Ion. So natural language processes, NLP use cases, you probably hear a lot about, yes, you can auto write articles that's like Robo journalism and is Robo journalism a threat? Is Robo a generated text and fake news a threat or an issue? And you will hear about, of course, yeah, you have all these voice assistants and like you will just use them for starter timer, which I did, or you just like do simple stuff. But most of the times our expectations on all natural language systems are a little bit overrated or over-exaggerated or we expect more than we get and of course, there's a lot of hype going on. So, but most natural language process, NLP use cases are not like trying to find a global solution, just think about a little bit more like, okay, you are an enterprise, you have a knowledge base. You want to really see what have you researched in the past and to organize this knowledge. So I would like to introduce you or show you like shortly, like a short small other use case we did in the NLP space at Koenig's week. The task was our client had like many, many text documents from like decades of research and they wanted to know, oh, actually, what do we have at hand? And so we want to understand what we really have and it's too many documents to read and search engine is not good enough because we have to look for specific keywords and probably the language was different. Maybe, yeah, it wasn't really working for them to they could really search into their knowledge. And so we use NLP to cluster the data and reorganize or make the data accessible to them again. So we had automatic keyword generation as we see here on the lower left side, we have like two clusters. So we use the corpus is actually about biodiversity and recycling here in the screenshot. And we also used automatically generated summaries. So it's really easy to go there and see, okay, what's for example, in biodiversity or if you look at this screenshot, it's a little bit, it's more clusters. You see, before we had like these like two clusters, one and two, which were look quite, yeah, but it could be an equal split. I always think you have to consider, this is just like a two dimensional representation of a multi-dimensional space. So let's look a little bit further here. So if we cluster everything we have into five clusters, we see, oh, okay, we have this cluster number three in light green, I hope you can see that. We have another big cluster five red, but then you see there's like tiny clusters in between. And they all have like a major topic. And you probably wanna dive into the topic and see, okay, which other documents are connected. And really understand what you have in your knowledge base and build on that. And you can do also funny things because here it's everything is quite, or many things are quite close to each other. Let me show you an example where you really can see like a huge split and probably a few Germans will probably get what this corpus is actually. And, but you see a very clear split between two things. And actually the one on the left side is actually the contents and the other corporate, the cluster number two, where it's just like editorial and extra stuff. And it was really great to see how we can automatically separate information from other stuff. And because we would only would be interested in looking into cluster number one. So just a little example, how the summaries work. And you can see like this are like on the left-hand side, you see the top keywords. So you see, like this is a cluster but biodiversity, degradation, soil, natural, forest conversation, you see how it makes sense. And we did not really had to do like a lot of tuning for that. So it was pretty amazing to see how natural language processing and the techniques we used could help us to get all the knowledge here. So, okay. So another example here. So if you have natural language processing tasks as I, or you enjoy like working natural face. So hello, I'm Alexander Hendoff. I'm a partner and consultant at Königswick. My domain is AI and data science. I'm a software foundation fellow. I like to organize organizing conferences. Maybe we met at some conference. And if not, I hope we will meet at one of the future conferences once we can meet in person again. So, and Königswick is my company. I'm one of the partners and we do data science AI projects and we love to work with people and also freelancers on projects. For example, let me talk a little bit about our natural language processing alchemy tool set. So as I said, there's a lot of blog posts and natural language processing has been really moving fast. A few years ago, you probably would, this talk would have been about NLTPK, natural language toolkit. So you read about things probably like Elmo bird or GDP2 or free somewhere on Twitter or LinkedIn. And yeah, so, but I would like to tell you a little bit more about Spacey here because Spacey is a great tool for NLP. So what is actually Spacey? Spacey was started in 2014 by Enix and Matthew. Yeah, let me bring along the screen here. This is Enix and Matthew. In 2014, they did a really bold move when they decided to start Spacey, open source it. And because at that time, everybody like the whole type said, oh yes, everything in natural language will be go to the cloud and people really tried to push for, yes, we have these general services, natural language will be solved like you can throw any data at our cloud systems and they will give you the right answers. Turns out natural language needs a lot of fine tuning. So it's really, it's necessary to really work in the domain to stay in a specific corpus because most of the NLP tasks are not really addressing. We want to answer everything or any question in the world. That's something Google wants, but if you were in a company, you really only want to see what's in our knowledge base and very likely you don't want to even like contaminate it with outside knowledge because you wanna see what's your knowledge and you can always Google. It's this NLP is not the task for that. So they started Spacey. So thanks Enix and Matthew. Matthew is actually from Australia. He was a researcher and he moved to Berlin and they met and they started something great. So their company is called Explosion AI and Spacey is a stable open source library. It supports nowadays more than 55 languages. It comes with many pre-trade language models. So you really have a solid starting ground for natural language tasks. Now it's designed for production usage. So and the key advantages are it's fast, it's intuitive. It has also like a great documentation. So there's examples and really very good explanations for everything on the Spacey website and it also supports the simple deep learning implementations. So there's a lot of out of the box thoughts. So for named entity recognition. So is this a person or a company? Part of speech tagging puzzle, which yeah, labeled dependency pie and so on. I'll actually see a little bit more about that if you don't know the terms yet, don't worry. The building blocks of Spacey is tokenization. So you basically it's just like take each word. Each word here is a token and punctuation is a token. So that's it's kind of part of speech is cat is a noun, scratch is a verb, lemmatization like cats to cats. Like it's just like from plural to singular or scratch to scratch. So to the past tense to just like the verb. Also there's stuff like sentence boundary detection. So something like hello Mrs. Poppins is probably not obvious because the Mrs is not the end the Mrs is not the end of the sentence. It's just an abbreviation. Spacey can also recognize named entities. So Mary Poppins is a person, Apple is a company. This of course depends on the language model you use and basically you know that you can get from there. So to distinguish between Apple the fruit and Apple the company, it actually depends on the context. So you wouldn't say, oh, today I ate Apple. And probably nobody can do that because the company is just like so expensive. So nobody can eat it. So likely it's the fruit. So maybe Apple can eat a startup, but I think that's yeah, anyway. Okay, serialization like it's just like a saving. Yeah, NLP documents, a little bit more about documents later. We have dependency parsons. So actually how do the tokens, the words, everything in the sentence basically depend on each other. How are they linked? You can train and update models on here. You can classify text and you can also add rules. Like for example, if you have like many companies have products or they what we call have company speech. So they use abbreviations for their products. And everybody in the company knows what people are referring to, but outsiders won't understand. And this is something you have to add to when you work with NLP. And it's really easy with our spacey. So this is a very simple rule. You can add more complex rules to actually help understanding the text, the language we're working on better. So we have some built-in rules. So still, although we have used a lot of statistical stuff in spacey, still we have rules. So for many languages, they're still like rule-based. For example, English plural is very simple. It's just like an S at the end. And stuff like that is built in spacey. It's they, the rules are more general. They are depending on the language are not covering every exception there. Because okay, it's open source, somebody has to maintain it. So if you find something new, remember you might also want to contribute back. And of course, not all features are supported for all the languages. So before you jump in, just check before you get disappointed because you'd probably expect it will do everything. Because always keep in mind when you talk NLP and you see great results, they are in the English language. English language is still the most research that is done is in English language and Chinese. So you won't have the same quality in other languages because they might be more complex or they are also not as researched. So there are also built-in models. Here we have a language model for many languages like German, English, French, they are trained on multiple corpuses. So just go to the spacey website and check them out. So you have language models, for example, with word vectors. So for example, this is like the classic example here. When you train on a large corpus and we put this all into vector space and then one great thing here is you will find things in parallel like king and queen is parallel to man and woman. So basically the model is able to pick up these connections and this can help in working with NLP. So you can also start with one of the built-in models. Many of them are trained on Wikipedia and you can start from there and continue training and update your models with NLP update as well if you have a specific corpus you want to train on. And always remember for training these models you likely need a lot of data. So it's not just like a few documents. So you really need a lot. The more the better is unfortunately a rule of thumb here and the latest language models rules like billions of parameters and this is beyond most people's or reaches but it's also beyond actually what we do require to make a successful NLP project in your company. So yeah, we got that already. Another great thing here being at EuroPython, Spacey is very Pythonic. So if you have a good understanding of Python getting into Spacey is quite easy or it's actually if you don't have a good understanding of Python I would really suggest to brush up your Python skills because you really should know about the mechanics of objects, iteration, comprehension, classes, methods because Spacey uses a lot of them. So very often we have some data structures and many different ways on methods to access these structures with, yeah, as we need. So it has a very extensive API for many things. It might be a bit overwhelming at first. It's probably not as overwhelming as the PAMDIS API but we're getting close maybe now, sorry. Yeah, so don't let you scale away. There's like an extensive API. So you have many access points to your NLP documents and to do things and stuff. So there is basically not much you have to program yourself if you are looking for something specific. So the first by suggestion here is like advice is, yeah, first look is there already a method in the AP before you start programming stuff yourself? Likely you will find it somewhere. A great thing about Spacey is also pipelines. So Spacey processes the text and pipeline says a default pipeline. So basically easy to see the text. First it's being tokenized. So take the words, exclamation marks and everything and we tokenize it. Then it's tagged, parsed and entity recognition that's like the standard process we see here. And basically this is what we call the NLP document. And you see here on the right-hand side before dock. This is not the last thing, so you can add custom. You can customize your pipelines. You can turn stuff on, off, add custom elements to your pipeline. It's super flexible and also Spacey is nice in scaling things. But that's probably you need a little bit more experience in Spacey to scale if you have a lot to do. So yeah, helpful is also the visualization in Spacey. They are used to, it used to be a separate package also high as an explosion. It's called this Spacey and it now comes with Spacey and you can visualize dependencies and entities. For example, like this. Here we see the astronaut walked through the spaceships corridor to shop off hall, help. So we see astronaut is a noun, walk is a verb. So the astronaut worked through and you see here the spans. The sentence was analyzed and we see these spans how the things basically belong to each other. And Spacey has that built in and you can visualize it. It's also good for to see actually Spacey is interpreting everything correctly because this is not, sometimes this is also not what a human would, how a human would solve this or sometimes, also computer make mistakes, don't forget that. Or another nice thing here is like here you can also of course do stuff like this. So we can see, okay, Dr. David Bowman looking for his Apple iPhone on Tuesday. So he's walking the corridor and looking for his iPhone here. So you see, without any extra training, this is everything you see here, it's just like Spacey built in from scratch without any customization. We already see David Bowman is identified as a person, Apple as an organization and Tuesday as a date. And of course this is helpful if you want to extract knowledge from documents or if you, for example, want to analyze maybe, for example, court rulings, you could say, okay, you could even get from person, okay, this is not only a person, the person is also a judge, something like that. Or you can, yeah, and so on and so on. A little time here, yeah. So serialization is built in Spacey. So you can save the documents, the vocabulary and the model. Think you have, these are like separate things you need to save. So it's not just one document to save. You're always set to consider what was the model you worked with, the vocabulary space it's built in. But remember, larger project might, you've made require different strategies like including a database because you can not always process everything from start to end. There's some danger zones, of course, you have always to consider not only the Spacey, also of course, everything you do in data science and NIP, like privacy, bias, always keep in mind, data is always bias, just try to minimize it and learn how to work with the bias and do not just say, hey, data is the truth. There might be some legal things you have to consider and always consider language is never perfect. It's always dynamic, these are the danger zones. So Spacey, the only thing you need or it says everything, well, everything can be rule, NLP can rule all the languages. No, I pointed out earlier strong NLP is really strong in English, Chinese. Of course, it's not bad in Germany, especially like translation is solved for many languages nowadays or at a very high quality level, but the tasks might be different. Spacey has a whole universe of extensions I would recommend to look into. For example, like there's some, like an extension from Huggingface, Huggingface is also a company who puts a lot of stuff on open source handy tools. For example, this is an example of the Norocoreference. So what is Norocoreference? So if you say Angela Merkel just announced a big, tax package, tax relief package, the Chancellor. So in this, the co-references would translate that to Angela Merkel, blah, blah, blah, and replace the Chancellor for Angela Merkel again. So we can do stuff like this. And of course, if we want to analyze text, we have to know who is actually the reference for Chancellor and Angela Merkel should be, of course, the same reference here because it's a reference to the same person. So what about bugs? Yeah, spacey is well-maintained. There's a good restance time on bug reports and it's constantly well-documented updates. When you work with extensions, yeah, it might be harder to integrate. You maybe want to stick to a specific spacey version when you work with specific extensions because they are all not moving at the same speed and the updates and everything. So you probably need a little different strategy here. So you can, it's probably not only one environment you need for all the tasks, you have to probably split them up and build a little bit more advanced architecture around it. And the development stages and our comparison to other tools I mentioned before, like NLP, there's also Gancing Textbook Pattern. NLP is actually, I think they're classic for NLP. But likely you have also heard about Gancing. All these have some strong suits and maybe a focus on some things. For example, patterns has also offers simple web scraper. But spacey, one of the strong suits is spacey is usually very close to the state of the art. So for example, if there's a new language model it's like transformers because before we also, everything was about RNNs, LSTMs and all these architectures like in deep learning. And then oh, two years ago, no, let's use transformers. They do way better and spacey is really good and fast in getting state of the art implemented into spacey that you can actually use it. It's flexible, it's extendable and it's fast because there's a lot of Python under the hood as well. Sometimes if you have a very special case, for example, if you want to build a contextual assistant or a shared boot, you probably are happy just using Raza. And but Raza is also using some stuff from spacey. So yeah, everything is connected as an open space. But last but not least, I just love spacey. So it's great. So thank you and stay healthy, more like that. And yeah, I'm not sure if we have time for questions. Yeah, I think we don't have time for questions right now. I'm sorry. Thank you for that amazing talk. I think we can take the questions to the breakout room. Yeah, sure. Sure, I'll be in the break room for a bit. So feel free to ping me with questions. So thank you very much and enjoy your Python and thanks everybody organizing the conference.