 that this talk is going to be about NLP and it's called NLP easy and we can see this already. You're going to present to us an easy workflow. I'm really curious to see what that is. Yeah, sure. So everything is ready for this. So please start your session. Cool. Yeah, so thanks for having me here at the EuroPython 2020. As you already said, I will talk about the workflow to analyze and enrich and explore textual data. We call it, so you said NLP easy. I like to call it like NLP easy. So easy, NLP easy, you know, language squeezy. Maybe first about me quickly. My background is in mathematics. I did a PhD in probability theory. Then I went a couple of years for postdoc in machine learning at the University of Stuttgart. Then I came back to Zurich, where I'm now managing consultant with the one solutions. Here, my projects are mostly data science, machine learning, AI, some infrastructure, visualizations, coaching of data science teams, stuff like that. And a little bit in my free time, I'm doing a couple of open source projects. The last one, plot we are here, I presented actually last year at EuroPython in Basel and today I'm proud to present NLP easy. A little bit about D1. So maybe a minute, we are a consultancy with over 50 data professionals based in Zurich. Most of our clients are in Switzerland. Couple of them are also abroad. We are covering many parts of the data pipeline the data journey of a company. So it can be business consulting here in the top level. Then also data architecture. So how should data be set up in a company? We are doing lots with data experience. So Power BI Tableau dashboards we can help you with. We also have some award winning visualizations of things. We help with data management. So the pipeline of how data goes through a company and machine learning AI that's what I'm here today for. We are doing smaller or bigger projects with that. And the NLP projects that we have done just to give you a little bit of idea what is my, how NLP came about. So what went, what were the precursors of that. So one project that we are, we are quite proud of is the product solution advisor for Bossard where we, that's a company that sells screws and nuts and bolts where we combined elastic search and Neo4j. Then for health insurance, we actually abused VirtuVec for non-textual data on the claims. We also have POC where we do, where we ingest documents into Azure cognitive services and set up a platform for that called Hawkeye. We've done some customer feedback analysis with basis syntax, dependency parsing and other things. And finally this NLP now. And yeah, we are proud that we actually can could show a couple of those projects in at least national conferences and also some international. So what is the background of NLP? In my experience, NLP obviously is a big thing. So it might be the next big thing. There has been big progress in the last years with respect to methods. So it started say 10 years ago with VirtuVec which was really a game changer for NLP. And now in the last couple of years, more deep models that are out there. These are really important. And one extremely nice thing about the developments in last years is that there are many pre-trained models. So you don't need to spend what they say, a transatlantic flight to train your big bird model. And something like that, you can just download it and then start using that. That's one hand, the methods. The other hand, there is abundant data. So there is lots of textual data for sure in corporations. So you might have customer relationship management entries, mails, documents, maybe customer reviews. There's text everywhere. And until now for the standard data scientist, they kind of weren't that accessible. But everybody knows they are really important because there are many use cases, this classification, sentiment analysis, named entity recognition and so on and so on. But why aren't the data scientists using them as a standard tool in their toolbox? So I think there are a couple of things behind that. One thing is NLP obviously is harder than say standard machine learning. It's much more higher than dimensional that what your usual machine learning methods as a data scientist are capable of doing. You also need some specialized pre-processing, how to convert those words into something that you can do machine learning on. And one thing also is a little bit that NLP experts usually assume that the text is the only thing that you are looking at. You want to extract everything from the text. And I think in most corporate situations and exploratory situations, that's not the case. The texts are just one part of the data and you might have other things that NLP experts then call metadata. So you might have a longer list of columns and one or two of them are texts. That's one thing why these NLP methods or packages usually they don't behave that nicely with the standard data scientists workflow. Others are the methods and models have a reputation of being really hard to use. There might be two or not. And also some other standard tools are cumbersome for textual data. So if you want to plot something, okay, you would go for a GG plot or C-Born, but how do you use text with it? So there is a little bit, yeah, it's difficult to grasp results with texts there. Power BI Tableau, they have some interfaces to text but they don't show it too nicely. And maybe your SQL servers, they obviously they can handle text but are they really equipped to search in those and stuff like that? So NLPZ is something like a vision and it's actually a package that you can download that tries to help you out with those things. If you are not that big into NLP yourself. So what is NLPZ? Let's see, so NLPZ basically in the end is a package. You have your data in some pandas data frames where each record is corresponds to a document, maybe with other information. And then you funnel it through NLPZ and NLPZ can help you with reg access, with spacey, with waiter, stuff like that. That's one part. So it will enrich your documents with by using other really cool giants on whose shoulders we stand on. So for spacey, maybe you have listened yesterday to the 15 things about spacey. I love spacey as well. So this is really cool thing to go there. And if you're interested, please look at the talk. And that's one thing, enrich your documents with NLP methods. But another thing is now, how do you get access to the results? And there we found out that usually, how do you work with textual data in your everyday work? You go to Google, you search for things. So our idea was it needs to be something like Google, something that can search. And that's where we ended up with Elasticsearch. So one possibility with NLPZ is then to ingest everything into an Elasticsearch database. That might sound a little bit too big for you if you're not accustomed to Elasticsearch, but actually we help you a lot with that because we can give you, we can start it for you on a Docker, Docker daemon that you have running. So we will see that. Yeah, it's Apache license 2.0, you can install it and you can add pull requests. There's a demo, Python notebook and so on. So how does it work? Let's go a little bit, I'll go through a demonstration just in a bit, but first to show you the basic digest one. So basically if you want to, you connect to an Elasticsearch server using just one line and it might start it on your Docker daemon if you don't have running something already. Then you need to get your data and NLPZ cannot help you with that. And obviously all the tools that you use for pre-processing your data, please use them on this data as well. So for instance here, this is the Neuro Information Processing Systems Conference where I scraped abstracts. Yeah, and then you start with NLPZ, you set up a pipeline. So first you start with, with you say a couple of things about the columns you have there already. So the message and the title, maybe you have a date column in the year and then you add some enrichment steps. For instance, RegExis. This guy here parses latex math expressions out of the message column and puts it into a math column. Vader sentiment calculates a sentiment on the message and space enrichment does a lot of things. So where we use a spacey model to extract entities, to extract part of speech, you can also go into dependency parsing and stuff like that. And then you're just in ingested and it writes as to elastic. So that's one thing that's really nice and you have it in a database that's really good equipped for textual data. But for exploratory analysis, that's not the best thing. It would be much better if you could just then look at the data. And that's what we do actually as well for you. So you can just with one command create in Hibana which is the graphing interface to elastic search. It will set up lots of visualizations for you and put them into a final dashboard. So usually that takes lots of clicks in Hibana to have that. And here it goes automatically for you. So maybe one more thing about how these different visualizations come about. So basically in the beginning, you say I have a couple of text columns and one date column. So after that the pipeline knows message and title are text and year is date. Then you add this regex extraction where you take it onto the message column and output a math column. So the math column now for each record is a list of extracts. So a list of tags. That's not the same thing as text. It's more like a factory and this category, something like that. Then if you add the Vader sentiment, it knows now, oh, there's a numeric column sentiment. And if you do space enrichment, here we add lots of columns. Couple of those are numeric and couple of those are tags. We will see that. And now if you generate the dashboard, the text columns go on one hand side in such an overview of that. And on the other side in such a nice word cloud. Now your numeric columns, they get into histograms and your tag columns get into bar charts. Good. So let's see whether the demo gods are willing today. I set up some small data set here. So basically I scraped the list of sessions or at EuroPython 2020. So yeah, that's standard beautiful soup thing that you can do. I don't want to get too much into that. I'm just here now having all of the talks with their title, the URL and the list of authors and the authors profile. So usually now in pandas, if you had something like that, you would search for that using such a cumbersome thing. Okay, then you'll see I searched for NLP, but actually where it might not be so easy to find where it is. So that's one thing. Another source that I took was during the voting phase of EuroPython, I also scraped all of the proposed talks there. So half of them actually did win. This is much easier to get all of, there were all of the abstracts on that page. So here again, some beautiful soup scraping and you see you have now interesting data. So you have a title, a subtitle, an author, some list of keywords, the type of the talk, the Python level, the main level, the abstract and so on. For instance, here you see the proposals type, they're different things. Actually, there are a couple of duplicates or two twins. So we dropped them and now we can actually can just, so we see here that all of the proposals titles are in the talks title. So actually they didn't change the titles. So we can just do a join here. Now we have the information whether a title did win or not. Good. So that's the preparation part. Now what happens with NLPZ? So I talked to you, we import NLPZ as an E. Now we start a new elastic search. So actually you see there was no elastic search found here and they tried to connect to something and it didn't, there was no container running on that on my machine with that prefix. So it started an elastic search and a Kibana here. So I can actually click on this thing and go to it. And then you see, okay, it's here. Cool. So not to, you need to have Docker installed. So chances are that you have elastic installed or Docker installed, they are better than that you have elastic installed. So I think that's quite nice. And also it helps you can have separate elastic search service for separate projects. So now let's look again at the, whoops, at the columns here. That's not bad. So we have title, subtitle and so on. So actually title, subtitle, abstract, these are all texts. We have some tag columns like author and the keywords and the type and so on and whether it did win. And we also pass the link to our elastic stack right there. And now we do something like we add a regx for instance here to that it should find out all of the HTTP links in the abstracts. Actually that doesn't work yet really nicely, but what the heck. We also add a space enrichment that takes a little bit time to load this model because it's something like, I don't know, 500 megabytes big. But the pipeline is not run yet, just that you know. One important thing here is that we also want to extract all the vectors, the spacey vectors. So these are maybe not that good like a fast text vector or maybe bird tensors or something like that, but they're good enough right now. That's why we also need to go for the, at least for the mid-sized English model here. And then we also add a wager sentiment here, okay? So let's hit it and you see, okay, it takes a little bit to process all of these files, but they are also now already ingested into elastic search. So that's nice. So let's also create the dashboard. And here you see there are a couple of things inside Kibana that if you start using elastic search or do something with it, you need to understand. But in the end, it's okay because you can, here, we just set it up for you. So you don't really need to really understand it. But that's one part. You see, it takes a little bit of time, but then it's there, okay? And we here, for the analysis later, we take only the one proposes. So now let's go to the elastic search stack here to the Kibana interface. And you see here, we have a dashboard and yeah, we have now a dashboard here. And you see, okay, let's dismiss this guy. You have here a nice interface where you have the results here, 152. You have all of the authors, all of the keywords, the type. You have, for instance, the word cloud for the titles. You can see what are the entities in the abstracts that the named entities that Spacey extracts. For instance, Python is by far the biggest entity that it finds API, Django and so on. You have the sentiment, okay, that's cool. Most of our abstracts are really nicely written there. So let's see if we now want to search for NLP. NLP here, we see there are four results. And what is really nice in this interface, it already highlights to you what these things are about. So, and it now gives you also the informations here and you can then maybe say, I only want to see the ones that we did win. So now there are only three, yeah. So it's really fun to go into it. You might also check here, so this is my abstract into the table that's now ingested into Elasticsearch with all the additional variables that are not even visualized here in the Kibana dashboard. Okay, so you see in a couple of minutes, you have set it up. But then you can also do more things. So the thing is, for instance, if we now look at these results, there are lots of things here that you can go for. Let's do a hierarchical clustering on the spacey word vectors for all of these documents. So actually here we put out a variable. You see here for all of the 150 documents, you have here the word vectors, there are 300 guys long. So we use numpy to stack them. So we have only, sorry, there are only 77 here because I'm only looking at the ones that did win. And we do a clustering and Tata. We can quickly show and see a first grouping of all of the talks based on the vocabulary that they are using in their abstracts. So let's see, for instance, here, apparently making pandas fly is somewhat next to my talk here. And there's also a Pythonic full text search, that's nice. No, okay. And then I also tried to look at, so now again, at all of the proposals, not only the ones that did win, you can do something like cheese knee on that and visualize which of the talks did win. So the green ones and which of those didn't win the red ones here. And you see they're kind of intermixed in this cheese knee visualization. So I also tried to train first just a random forest on it. It didn't work nicely. Then I tried to use bird even. It didn't go very nicely. So probably it's not so easy to predict which abstract will win and which won't win. So probably it's also a little bit, you see that they're often pairs next to each other and probably they're not independent. So if one gets chosen, maybe the other one will fail. Or yeah, this was chosen over the other one, something like that. Okay, so that's the end of my demonstration. Let's see. Demo gods were very helpful. So you saw already the similarity that we had. Here we did it on customer reviews for restaurants in Zurich in TripAdvisor. If you are from the Zurich area, you probably know most of them. So you will see that Hiltl and Pipits are really basically the same, really similar in their way. And so they're here nicely in a cluster. You have here more the beer things. You have here the very expensive ones and so on. So this is really nicely that you can do stuff like that with NLP and you can also use the sentiments corn on those restaurant views in Kibana. You can really just by clicking, so if you have the geo coordinates, also in the elastic documents, you can set up such a geo view and overlay the sentiment here just directly. So this is really nice to work with those things. You can also do network visualizations here. For instance, this is a kind of an insider or whistleblower platform in Switzerland regarding financial news. And we used just the entity recognition to link people and organizations and you really see how this unfolds. One more thing. So you can actually go to my binder and just start it because we set it up that you will have a... So my binder actually just starts up docker containers for you with two gigs of RAM and we start up a Kibana and an elastic search server and forward the ports over the URL. So this goes really nicely. Here, please wait maybe a couple of hours. I didn't have it yet in the master branch really. But yeah, it will be there in just a moment. And there are also the other things that you... So this is what I showed you why I had Jupyter running locally and I opened two docker containers. But you obviously can also use Kibana and elastic just running like on itself. So yeah, thanks. My time is basically up. And your PC is open source. So please go ahead and look it up. PIP installed it. If you have PRs, yeah, they're welcome. The package is still under development. So I don't have... I do it on my own time mostly. So I don't have time to invest too much into it but there are some more upcoming features. So like adding more stage plugins for BERT or for cleaning. Also have a better support for incremental working when you ingest it. You do your pipeline for the first time ingest it into elastic search and maybe add some more things and so on. So the usual data science workflow that would be really important to put there. Have more stable APIs and documentation. There is some documentation already but it could be better. Support for integration of the pipeline into a real ETL thing. So that would be something cool. Yeah, so if you're interested in NLP or other projects or yeah, we are hiring, please contact me. Here's the mail address and I'll be available in the talk NLPZ Discord channel now. So thanks a lot. Yeah, thanks a lot for showing all of this to us. Thank you.