 Thank you very much. So I'm Stefan. This is Henry Welcome to our talk and yeah, it's our first time at Euro Python and we really really like the conference and we are also very new to Python and we would like to share some of our Experiences or projects so we will talk about an implementation project and how we will achieve our visions One thing is very important. We're talking here about old-school data management So we don't think old-school is bad because if we remember the party yesterday And if you think about when did the party started it started when the DJ started to play old-school hip-hop songs So we really like old school, but we also would like to share some of our new ideas with you Some words about us so We both wanted to join the community and that's why we are here both and And I'm more a little bit responsible for products Hendrik is our Python guy or data analyst and Yeah, he's doing some research in Karl's Ruer at the Kai tape He's doing his master there and he has a lot of experience in machine learning neutral nets cognitive systems And he's doing a lot of research On event detection in big data streams We are working for so this just some small words about our company. So we are a small company from Germany Our core business are at the moment web applications collaboration systems and mobile solutions and Yeah, within the next 20 minutes or 30 minutes We would like to share our journey how we discovered Python why we are using Python And I will start a little bit with our vision Because we will talk about one of our products at the beginning that exists for 20 years and with Python we are able to implement these visions and Then Hendrik will show you how we do this and how we use Python to cluster communication and detect events and Another reason why we are here is we are the new guys. So everything is new and we like it If you have any feedback for us, it would be a pleasure if you talk to us later because we can learn a lot I would like to start and this is the only let's say sales slide to tell you a little bit about our project Our solution. So we are taking care about collaboration and communication in large projects and the idea is that in this solution The industry is taking care of correspondence documents everything you need in these big projects and they are sharing this information This project exists for 20 years now or this product exists for 20 years now and It all started with Lotus nodes Dominion. Does anyone knows this? Yeah, okay So unfortunately today not a lot of companies are using Lotus nodes anymore, but 20 years ago We started with this product using Lotus nodes Let's say eight years ago we switched to Java web applications and I've learned a lot this week So we all love Python, but Java is a little bit strange. So but in 2009 we decided, okay, we have to go to a new technology Because all of our customers left Lotus nodes. So we have to make this decision and this was also the time when we decided Okay, we have to implement some agile agile development methodologies, etc and two years ago We finally discovered Python because Python helped us really to implement our visions in our products And before we start into this vision what we had at that time Maybe a little bit more about the challenges that we have in the industry For example, our customers are building these kind of big plans and in this project. They're Taking two years five years ten years fifteen years We have a lot of communication and a lot of complexity That means we have a lot of information that we have to take care of and we have a lot of data that we have to take Care of and this is getting even worse. So when we have a look at how many people are working in these big projects So we have different disciplines. We have external persons who are working in this project Engineers commercial guys sales guys Project management teams from different suppliers consultants, etc So a lot of people are working together and when we have a look at communication in this project It's really it's a mess. We have thousands of mails. We have today We have messengers we have a lot of different systems where these information are stored And it's really really a big challenge for the industry to take care of this information And this is even getting worse. So when we're talking about portfolios so the big players in the industry they're taking care about thousands of projects at the same time and We have to find a way to manage this kind of Communication and information and there are a lot of solutions out there We're trying to solve this problem. And what are we doing not just us But everybody in the industry we are trying to manage our project communication So, I mean in the IT departments. We are using slack or JIRA today But in this kind of projects people are still used to manage communication and data in folder We have a folder structure. We will organize this information. We have tools like Correspondence metadata we have a controlling possibilities with all these kind of manual works so you can define favorites tags Reports filters and full tech search, etc. I think you know that from all of your solutions or from different kind of solution The challenge that we have here. It's everything is manually So you have to classify everything manually you have to organize your data manually in these kind of systems and This is something what we call. This is our search content So if users who are using this kind of systems Want to search for specific content or information? It's always like yeah I know the topic I know that there's something in there And I'm going into these kind of systems and I want to find that so we are doing that the same way In in our solutions that we provide for the industry So everything is this kind of search content where you have to look for this information But we had the vision So, I mean we all know Facebook and these cool technologies So is this kind of manual classification of information still state of the art today? So do we manually have to classify information? Do we manually have to classify? Correspondence this was one question that we asked ourselves and the other question was We can manage this data, but we always manage this data by let's say tags Who sends it who created a document? But we never used the core the information that is in the document or correspondence It's often our vision was always for the last years Can we change the way how these people work to? Give them a support to provide some content and information for our users So our question was how can we implement the possibility that our application provides content? Can't we present this kind of concerns to the specific audience and we always ask us, but we never found an answer To these questions and we never have the technologies to do these kind of things. So what we did Yeah, we talked to our customers like all the other companies, too and When we did that we got a lot of information back and we summarized this kind of information So we developed our vision The challenge was our vision we could not implement that with our existing tools like Java and all these things and and Just to summarize our vision is we wanted to implement these cool features like recommendation engines here can't we Drill down the information to the information that user needs at a specific time Can't we use these kind of project correspondence and communication data to identify? Domain experts in a project here in this big if you're working in a big company And we all have all these information available can't be profile users can't be tell our our Community in a company. Okay, we have these kind of experts can't be automatically detect this Can't we identify trends and risks in projects? So if a project manager is opening our The program in the morning and and the application tells him yeah welcome back to your project something important happened and And please have a look at that and we also said okay can't we implement things like clustering and event detection? So when we have a lot of correspondence a lot of information can't we? implement an automated process how we bring these information together and That's what we are going to show you now because now we will show you how we implemented What we called machine learning as a service That allows us to identify topics and clusters in correspondence And of course we did that with Python, and that's why we're here and Henrik will show you now how we did that in detail Thank you One will come from me too I will show you how we solve some of the the problems Stefan identified Just now and I will talk about the task of identifying topics hot topics and events in social stream data as you can see Communication within projects emails and correspondences just as social stream So what's topics after all topics are basically labeled clusters and clusters are points in a space Which belong together due to their similarity so on the picture on the right side You can identify three distinct clusters the red one the blue one The green one and the blue one and some outliers outside of the cluster You can depict them as as communications or emails or tweets anything which is a communication basically and if you Manage to put a label on it. You have basically Yeah, you have identified the topics basically so maybe the the green cluster is concerned about order 66 The the red one is concerned about project management And the blue one is Concerned about invoices So what's hot topics after all hot topics are basically? Communications that belong together which grow exceptional in a distinct period of time It's similar to a trend the trend evolves over time You could see the Euro Python as a trend if you monitor the Twitter stream It begins slightly before the Euro Python and will hold slightly till after the Euro Python as the message with the hashtag Euro Python will Will be more in this time period in contrary we identify events in in streaming data Also as exceptional cluster growth but in a shorter time period or the building of a new cluster as There is a communication which is Not similar to any other communication which has to be put into another cluster I won't speak about noise cause this is another hot topic and I could fill more time with it So what's the the information after all which we have available to to identify these basically? We know our participants We know our content and we have the meta data, which is manual manually Put into our communication and ordered by our customers so we can build a communication model This is a social stream graph people talk to other people send Messages to each other and those are maybe tagged with the the aforementioned meta data So we compare messages and text to each other, but we also Identify groups of people belonging together as they are highly communicative within this group and Outliers of course who talk only less with other people and so on So what's this graph built or each communication built of basically? This is a atomic model. We call it social stream object Each communication is based on a sender a content depicted by the edges here in the graph and a set of one or more receivers Basically, it's a hypergraph as t1 depicts only one message which is sent to three people If you compare it with with big streams like Twitter You would have a big audience every Twitter user who is able to see your message, of course So what do we do? Basically the hardest topic within it all is cleaning and normalization of the data Of course you have much noise in in any communication data This is a machine learning problem and for example to To remove footer or reply lines from emails or other communication and stop words Utilize neural networks explicitly a multi-layer perceptron to Remove those lines from automatically from email which is trained on our company data Then we compare the textual similarity. We compare the structural similarity who sends in correspondence to whom and other similarities are the tags We have the metadata we have in within our communication The similarities or detail are basically relatively simple They are the term frequency inverse document frequency based cosine similarity between the correspondences or the clusters and we have also bit vectors which depicts the Sender receiver sets within a cluster and the correspondence and We normalize tech mutualities between the different correspondences so the most algorithms suited for streaming data and email and company correspondence could be seen as very slow sleep stream data Expect one value so we built a linear combination from the different similarities and generalized them on on A different on many similarity measures we can gather from the other Things lambda here is Harder to to infer from our system domain cause It seems that the structure who sends whom an email is much more beneficial For example for clustering than the actual content for clustering the information So, how do we do this? As we evolved from a Java company Yeah, Java is not so good for data science You need just too much time to build boilerplate code. You can't experiment fast on some new data or algorithm so Yeah, I would call it resting in this case but Yeah, Python Delivers awesome awesome libraries, which we utilize for justice cause so we have Jupiter and pandas for quick experiments on data or trying to to implement some machine learning algorithms We have spacey and awesome fast natural language processing library, which we utilize to do Lematization does everybody know what Lematization is? Okay basically we We try to get the word stems the base words from each word to get a normalized representation of the word Which is really fast We use flask to expose our services to our other solutions and we use the scikit-learn to implement for example multi-layer perceptrons and support vector machines to identify noise in in correspondences and Our so the results of our research is stored in a MongoDB, which is very good and very fast No sequel database, but I guess everybody knows this. So what's our workflow? It's slightly different. We came from a normal normal iterative incremental work and Now we have to do research as some of the solutions still or just don't exist You have to to experiment to get the solutions right So basically we begin with a Jupiter notebook do some work try some something out and from there we go to and design and implementation test of course and a partial deploy of a business functionality you could say But due to our inexperience with Python between Jupiter and design and between design and implementation There are often some hiccups cause you have to adapt to the to the Python rules and can't just Build or it would be really ugly to build Java solutions in Python so the the cycle is a little bit shorter and You would jump over Yeah, between implementation in Jupiter notebook more often than wished for So how do we act? How do we interface with our existing solutions? We build Java web applications with their own quietly sophisticated security measures authorization features Authentification and their own object relational matter like hibernate for example a quite modified hibernate which makes things harder and multitude of databases Which our clients want to be supported? so we have to to expose our solution in other ways and This would be basically Pyrsa stands for our peer system and are for analytics would be to expose and control API to revert control back to our other solutions a resource API which stands for the findings our algorithms do and Of course the processing and the state management These are basically the the simplest parts of the application and we call it simple M last Machine learning as a service. So which where what was what are the the challenge we had? the challenge are Basically security concerns we handle highly confidential material in case of a plant building or chemical sites and Yeah, how do we implement the security or how do we guarantee that the security between our former systems and The new machine learning systems are held then The services therefore must be designed specifically and we have to adhere to some security standards within the the industry Interfacing was a problem cause If we want to Analyze data within a former database then we have to yeah access the database directly Which would break the security rules? So we we decided to do a loosely coupled system for for example on just read don't write of course then we are relatively inexperienced to the scientific cycle and So process iterations within iterations may be not the best thing we can do and We have to gather more experience And The last concern is an ethical concern Privacy this This information we build can of course be misused to spy on On workers co-workers. How do they work and so on? So it's a consumer problem. What do we want to expose really from the things? we we find in our In our in the customer's data, so this both processes are not finished and Yeah advice would be welcome In this case, I want to thank you for listening and If you have any questions, we would be happy to answer them Okay, we have almost eight minutes for questions Break for the coffee break so great on time excellent So any any question? Yes Can you specifically? Thanks you. Thank you very much for a talk and for this tremendous transition from Java to Python and Specifically I'm interested in what's what the exact tool set from the Python Standard library or maybe something else You don't hear me, right? Could you repeat the last part? I I've understood Python standard library, but Basically, what was the key things in your decision to to switch to Python from Java The key thing why we didn't do Java or the the main point was a search for natural language processing libraries and We compared them we have speed constraints and the fastest natural language processing library out there at the moment would be spacey and Every millisecond added here or added here would greatly slow down our our solution space, of course and After researching a little bit of space in experimentation the Yeah, the choosing wasn't hard because of course we need a multi multifunctional language on the other part So data science tools like our even if it sounds like a pirate language, which is cool Wouldn't come to mind and Python has great capabilities for interfacing with with other technologies, of course So this were the the basic reasons because why we choose Python Any other question? Yes Do you share some of your code open source in some way or on GitHub or whatever? Basically The the sharing of the code the similarity measures the tools but not exactly the the streaming code unfortunately Because there are the reason behind this is not we won't share We don't want to share but there are a customer specific meta text included which would Give hints on on processes in within customers by customer projects and this Can't we can't just do it is Any other question? Yes There were some some hiccups of course and a little bit of bumpy road the thing is The first things where we are not so hard the experimentation with Jupiter and Python is a really Is really a language which you can learn quite easily but master quite hard So Yeah, I try to we try to to get more experiences on this Right now my solution space seems a little javanic I guess but I have to forget some of the Java things and Throw them overboard in my mind to to get better better systems Yeah, I if I can I won't go back actually To Java Okay, but I have to admit we are still using Java for the product. So we are not only Python now So we're using both worlds Any other question? We have three more minutes for some questions. If not, I can ask a question myself Very fascinating project and the great success story seems for Python obviously Hey Python Can you give us an idea on the type of scale that you're working at like? I don't know. I imagine you have tons of email messages or you know chat messages come in If you can give a rough idea You want to know how many or yeah, how many messages you go through your system. It's project specific But we have you can Our bigger projects have 200,000 to 500,000 Communications, but this is not the case not only the case there are documents involved also which are quite large and we search Also the document space which isn't mentioned here cause for the clustering process you First don't need to to inspect the documents and this above many projects but about the amount of projects Stefan maybe Better able to answer I would say an average project you will have about six to ten thousand emails a month So it's not really big data, but you have a lot of additional information that we evaluate like actions or other messages and An average customer of ours has about let's say 400 to 500 active projects. So That's the the average amount of data Any other question? We have time for one question Well, I can also ask the last question myself Well so you mentioned privacy issues and I could imagine Some people especially workers in those companies could be a little bit nervous. Do you get? I don't know any feedback or any resistance from anybody or People are pretty happy. They see the advantages Yeah, it's at the end of the decision of a customer company if they introducing this kind of systems but we got some feedback if you remember the slide when I said we talked to our customers and During these kind of workshops. We had a lot of Contraverse conversations. So You found every opinion people who like Facebook. They also like this kind of systems for example But you also found engineers or project managers and say no, I don't want to have such a system because I still can think But this kind of tool, it's just a tool that should support your daily work So it doesn't force you to do anything different than before So you find every kind of opinion regarding these Tools excellent. Well, thank you very much. Let's thank the speaker again