 My name is Emin, so let me give you a meta talk first. Two years ago when I was in Fox Asia, I saw this Apaches app in presentation by Alex. He's also around here, he's doing something else now, and then when I saw it, it's kind of clicked with what you want to do because it was filling in the gaps that I had feel doing data science in my company. So I wanted to talk about how this fits into our model, and how we leveraged JVM-based language spark to do data science. So to give you more context, let me start with a little bit about me. This is my identity formation you can contact later. I'm a computer engineer and a data miner. So it's still a thing. So I get degrees on on both of them. So you can say that I'm a very hardcore engineer, but now I'm doing data science. This is my title, Full Stack Data Wrangler, which means that I can get my own title in my company, which is a nice thing. But what I do is I work on every aspect of the company. So I do DevOps, I do engineering, I do data science, I do the modeling. So that's why we try to keep everything compact. OK? Yeah, I always prefer tabs, but now I'm comfortable with two spaces. And I kind of have a love-hate relationship with Python. Probably there are other Python developers here, but it's too easy for me. It's kind of annoying how easy it is to use Python. And to give you more context, I will talk a little bit about the company. I will keep it for the context, no worries. I don't have 10-minute videos to show. So we have two main products. One of them is network planning. That's what I'm curing the working on. And we have also mobility intelligence, which means if you have XYZ data, so location and time domain, we can give you insights out of them. And we also have APIs. The company is called Dataspark. You can find it online. If you want to use APIs. Our entire stack is on pre-opened software. It's mostly Apache-based, cloud installation. We try to keep everything close. And we deploy our product in weirdest environments. OK? I will go into more detail in more later. But most of our customers are enterprise customers. And sometimes we have to work on weird rooms with weird access rules. I will get into that, but just to give you the context. And since these are products, what we develop has to be maintained a few years. So I still have to maintain a quote from two years ago, three years ago, which is a big pain if the developer has already left the company. So that's why we try to keep everything compact. And what we do from data science to the product, we want to streamline the process. And just affect the data volume. The process can be up to one terabyte a day, which we don't always use, but we have to be scalable. So I want to learn a little bit more about you so that I can maybe tune by the presentation. Who does data science here? Show of hands, please. OK. Good. Who develops products? Like develop and then deploy at the end somewhere. OK. Good. And who's familiar with Python? Almost everybody. Who's familiar with JVM languages? Good. Two. OK. They are still doing good. Who's familiar with R? Ah, OK. Good. I don't like these two, but don't take it personal. OK. So at first, I was thinking this presentation more about how I don't like Python, but then I realized that you deserve better than that. So I tried to stay away, stay away from that. So the challenges we face every day as the company and the product. So we need our algorithms to be accurate, because it's already a very competitive domain. And accuracy is one of our selling points. But this high accuracy should be on the big data. So if you go for very complex models to do very complex modeling, then they will probably not scale. So we have customers in Sinktile, which is 3 million city. Right? It's 3 million population, almost. But when we go to Australia, nothing scales. So we have to always keep that in mind. We try to use modern architecture. We have very hipster engineers, developers. We use containers. We use microservices. We try to combine them on APIs. But then, it's the enterprise world. So for example, we try to summarize everything, because the environment always changes. And then I went for a deployment in a telco operator. It's almost as strict as banks, if not more. And they say that they generally say that, yeah, give us the war, WAR package. And then, do you all remember WAR? What AR is? OK, you don't have to. It's already, this time has passed. But in one deployment, for example, I went for, we packaged everything. We had zip files. We have dockers. We have algorithms running on there. We went to the USP. And they said, yeah, you can work here. OK, we opened the computer. Then we plugged in the USP. They said, no, you cannot copy anything to the servers. OK, but how are we going to deploy there? OK, you have to email us, and then we will put it there. OK, we give the USP. It's around 2 gigs, maybe 3 gigs, because of the docker containers. They said, OK, we can load them into one week. What's happening? They said that they have the bandwidth limitation, the IT department itself, as a bandwidth limitation to the production server. So they cannot, the copying of 3 gigabytes files, 3 gig files takes one week. So they have to find a loop around and stuff like that. So the enterprise environment is a mess, is a hell. I hope you don't have to work on those environments. So as a data scientist, my boss sometimes comes to me and say, yeah, we have this data. Let's explore it. Let's science the heck out of it. OK, auto sensor. OK, I have pandas. I have NumPy, SciPy. I can do run algorithms on that. Then they say, yeah, but you cannot download the data. It has to be on the server. OK, I can put it on the server. Yeah, but you cannot download from HDFS. It has to stay on the HDFS. So it doesn't happen every day, hopefully, or thankfully. But it can happen. And we have to be ready for this. Then we use exporter algorithms, right? It's the same thing. I have the perfect algorithm for this. And I say, OK, I can use it. I can develop on a sample data. And maybe we have to deploy it. So they come to me, OK, we have this data. They want these results. Can you do it? I said, OK, it has to be quick. We have to deliver it one, one week later. OK, I developed something, put it. Then they said it's an SLA. Do you know what SLA is? It's service level agreement, which means you have to maintain it one year or a certain period of time. And those dirty code goes into a $2 million or multimillion dollar project as a small feature because our sales team is very aggressive. Hopefully, this doesn't happen. Thankful, this doesn't happen much anymore. But this is something we have to challenge. So our entire stack has to be scalable. So we have the JVM, we have the Spark, Apache Hadoop framework. And when we try to work with data, sometimes we need an algorithm from the product. It can be a data cleaning algorithm, data aggregation algorithm, or more complex algorithms. But when we run with or work with Python, we realize that that perfect app that we want to use, the perfect function is in Java. So it's very difficult to combine all of them. And sometimes we are lucky enough so that we can download the data. We can have a sample. We can work with the Python, with a team. And then we can work on Python notebooks. And since it's a team, let's say we can work together. We use GitHub for collaboration. And there's a conflict. Have you ever tried to clean a conflict, a skit conflict with a notebook? It's a cell because it's in JSON format. It converts to text. So working with some other people on Python notebooks, I find it very hard. So I just copy, paste everything. And it becomes un-maintainable. So these are some of the challenges that we have to face to develop a scalable data science product. I'm sure you'll face some of them in some days. I hope you don't. But if you are facing these kind of things, it's very hard to swing between JVM languages, Hadoop, and Python, at least for us. And I also don't like Python that much. I may be a little bit exaggerating. So when we were facing this, as I said, last year, two years ago, I was introduced to the Zeppelin. It's an Apache project. And did you already hear about this, anybody? OK. So as I said, it fixed all the right boxes that I had that problem set. I tried it two years ago. It was 0.6, which was pretty unstable and was not capable. But the latest version is 7.3, which we use almost every day in not production, but almost production. So it's still a notebook, like iPython notebooks, Jupyter notebooks. So it's a familiar environment. It works on cluster. So it can work with our production data, with our staging data. And the users don't have to maintain it. And it's always at the same place. It's multilingual. You don't stack with one language, which actually the premise of Jupyter is three languages. But I don't think anybody uses other than Python. And having multilingual means more people can work on that. Now I can show you some code. This is the welcome screen. And I can create a new note. I give the name for Seja. We can already have a directory structure presentation. So this is the screen, which welcomes you. It's similar to Jupyter notebooks, as you can see. It's a little bit more fancy, maybe. So we have standard buttons running the paragraphs. We have access to all of the notebooks. It's still an engineering product, so it keeps everything in one place. We have some kind of version control. It's not very mature, but it's there. OK. Without further ado, let me start with something very basic. OK, 3 plus 5. And run this paragraph. This discord has been run on our servers, on our clusters. And I get the result. 3 plus 5 is 8. Do you recognize this language, anybody? It's Scala output. So the default is Scala here. I can go for Python. 3 plus 5. And run it again. And this is the same output from Python. As you can see, Scala is already faster here. No, I'm just making things up. I can go for SQL. Select 3 plus 5. OK, we get some results. It's still 8. No changes here. I can go for R. 3 plus 5. Run it. OK, and this is our output. Until here, it's all Spark, basically, because all of these languages are supported by Spark. So we are still good. I can go for SH. And I can go for 3 plus 5. And here, this is the shell command. I just ran a shell command on our cluster. I can go for MD. It's 8. So it still supports Markdown. Let me put like this. Maybe 3 plus 5 is 8. So it supports multiple languages, and it gives you a great flexibility when you have a heterogeneous team. Everybody can work on the same notebook and then to work with the same data. The interpreters, these are all called interpreters. And the supported interpreters are here. It's Spark. I'm the AngularJS. I can even write JavaScript here. And Python, PyS, Psquare, PostgreSQL, JDBC. So it's a very long list. OK. So another advantage I was very happy with, Zeppelin, is the built-in support for interactive graphics. So let me go to the demo very quickly. Let me hide this. So here, I create a data frame in Spark. It has two, three columns. And I create a category and then put the years. Then I say just the Z show. When I run it and enable the output again, it has this very nice plugin, Zeppelin. I think they use a library, but I don't know which one it is. I can look the data in a table format as a bar chart, as a pie chart, line chart, and with Zeppelin 8, it goes even to the, there are plugins for it. So this is very important because when I start with the data science process, I always tend to look for histograms to get the distribution of it. And before using this, I was going for Spark Shell, which is very easy to use. Then I create the data, save it on the HDFS, download the data on the cluster, then onto my machine, put it into Excel, open it. It's a self-process. But here, I just say Z, let's show. And it shows me the data. If your file is in CSV, so the command is something like this, you can go directly for SQL. The SQL can read CSV files, although a little bit slow. And these bars are coming from the Yarn cluster. I hide the. So this is some kind of data. It's limited to 10,000 lines. So bar charts, if I can still see the bar chart, these are JavaScript generated. So it's living in your browser. If your data is not structured yet, you can go for more customized format. Let me go through this thing. So I just read the file in Spark. I split the bar by commas. And I just say that count the first column. This is a word counting example, basically. Then I gather it to a DF and then show it. I don't want to run it now. It's my chart this year. Time is important. If you want to go for your standard things, you can also use matplotlib. And you can plot your ordinary matplots. Matplots. OK. If your data is not structured in any way, so this thing, this plug-in, can support custom output. So here, this is in R. If your first line of output is a percentage table, if it's tab-separated, it's a little bit hard to read, but it's named tab-size, the new line small tab 100, it can plot anything in a tab-separated format. OK. And if you really want to go for it, it supports also external libraries. For example, Google, this is an R library. It can plot it. It can show you the R plots. They are all running on server and the output is here. OK. So this multilinguality has a very good advantage when you work in a team, because everything is in one place, everybody can work on the same data and then just prepare one big notebook. OK. So as I said, our product is in JVM, and we sometimes have to use our own libraries. And since Spark is also, Spark is also JVM, and Zeppelin is a citizen of that JVM world, we can use our libraries. And there's a seamless transition between languages. I didn't show you that one. I can run multiple languages, but I can also pass data between languages. And it supports multi-user. OK. Let's first start to columns. OK. So here I import a library, a function from our own library. It's string parser. It's a very basic one. If I read the data file, the path is hidden here, so it's confidential. Then I get a sample, and I run my parser function on it. And I don't want to run them currently. So it imports it, and it creates my RDD, which is a Spark data structure. Then I get these visits. OK. And it's still on the Scala Spark. Then map, and this is kind of a word count. I'm just creating a histogram of sizes. And this is his star variable. And his is in data frame format. You can see the variables here. Then I say zshow, which I showed you before. I showed these in Scala. I put these two variables into z. That is the Zeppelin mediator, I can say. And then I register this table as a SQL table. Then I can go to SQL. I just select Asterisk from Istotet, which is this one, Istotem table. And I can do the filtering on top of the data set I prepared in Scala. I can go to R, and I can run SQL script here, but you can also go for native languages. Then you can work on this data in R, the same data I prepared in Scala. I can go to Python. This is Python. This is PySpark. I can read the data frame into Python and run my Python code. It's a very simple example, but you can do more complex stuff, of course. Another thing, so until here, it's still data frames. It's integers and stuff. I can have a custom class in Scala, in JVM languages. I can write a converter. It's kind of a serializer for that format. I can put it into z. And in PySpark, I can read Java classes and then work on them. So these arrays are now Python arrays. So it's all Spark. Basically, this is all thanks to Spark. But giving this kind of an interface really improves the collaboration, because anybody can work with the same data, just copy-paste it or work on the same notebook, even though they come from different experiences, different languages. OK. So security is a concern, of course. And the security of Zeppelin is done by Shiro. I don't know whether it's an Apache project. So currently, we are using per user basis. It supports LDAP and Active Directory things. And since the data never leaves the cluster, the security is always good. Not always good, but it's good enough. The part I love most about Zeppelin, it's a brainchild of an engineering thinking. So Jupyter is mostly used by scientists, data scientists, or physicists, or chemists. So it's very easy to use. It's very stable. It's very convenient to use. Zeppelin is not any of them. It's kind of complex, but it's very customizable. I will show you one example. As I said, we also work on mobility intelligence. I hit most of the boxes here for now. But let me show you this. So I collected some data from MRT lines. I go there and check the MRT lines. So this is our custom trajectory code. And I put them on top of a map. So these lines are drawn by our system. So how this works is, let me make it a bit smaller so that you can see. So I prepare the data in Scala Spark here on this box. I can even select which date or what kind of data I'm working on. And I can just go there, run this thing, and it shows this thing on JavaScript. So the point here you don't see is Zeppelin does not have a built-in support for maps. So this is all done by me. Not by me, but this is something I did. So what happens here is I had to write this paragraph, which basically creates JavaScript code. And because of the interpreters on the in-front, so it can support JavaScript, I can turn my Scala code into, I think I'm out of time, maybe two minutes more to wrap up. I can write my own scripts to convert Scala code, Spark code, or Python code into JavaScript. And I can extend the functionality here. So this was groundbreaking for me in both in terms of Zeppelin and doing my data analysis. Because we always work with geodata that says our job. But putting the data from HDFS or visualizing the data in HDFS is very difficult. And then this creates that, the Zeppelin creates that that bridge for me, and it makes our process much faster. So you can go, and also you can see it supports input fields. So you can, oh, OK, it doesn't work. Let me change the subject. OK, so you can go crazy with it. That's a good point. OK, I showed you the examples already. Configuration is a little bit hard. So because it works on the cluster side, it works on HDFS, it has to be secure, it tries to combine many things. So configuration is not straightforward, but it's doable. We had to write some scripts to give to the user. So they just have to log in to the system they have access to and write two scripts. And then we create the hashed passwords for them. So that part is a little bit still manual. And we have to add the jars you use. So it also supports Maven. So if it's a public Java, and if you have internet connection, you can always download. So some challenges, it's very complex. What it tries to do is complex. So I think it's OK, but it kind of affects the stability. So it supports report kind of thing, as you saw that the paragraphs are resizable. But we cannot use it for production yet, or we cannot show it to the customers yet. It requires connection to the cluster, which is a downside. Because before coming here, I had to set up VP, and I had to make sure that there's no load on cluster. So it's a trade-off. Yeah, cluster load can be a problem. But thanks to this, we have 15 people in five different countries working on data science problems in different combinations and various combinations. And we can still develop data science products from exploration to the deployment side as a team. So everything we do in this framework are testable, because it's still Java code, Scala code. We can just write the test, package it, and then we can deploy. Yeah, so thanks. This is my contact information. Yeah, that's all. Thank you very much. So I think we can maybe do one question before we head up for one, because you have a... Yes. Sorry, so after the using, accessing the, what is this called, the cluster, cluster, how do you exchange data on all the different languages? So because it's using Spark, Spark has an interior representation of the data. So it converts everything back to JVM-related format. Anyway, so we leverage Spark on that part. But Zeptin itself also gives us a context called Z. And we can just put and get the data like a frame. So here, for example, I put it as Z. The different languages not always supports the same data, but if you have a serializer that converts it to primitive types, integers, double strings, all of the languages can understand. So it may require a bit of extra work, but it works the effort. And it works locally within the browser? No, no, everything you see is as run in the cluster. Oh, so you actually, but that will be quite, that's good. But I run most of the examples on the cluster already. So if it's a small data, the latency is not that much. But if it's a big, big data, you cannot run on your computer anyway. OK, thank you very much.