 Good afternoon. Thank you for attending this talk. Although there's medium attendance in this talk, I was expecting a little low attendance. Yeah, I know. It's a very specific talk and that's why I think it's a very good attendance for me. So let me introduce myself. My name is Jakardo. I'm a senior software engineer working for the Office of the CTO in a group called the Artificial Intelligence Center of Excellence and I'm working on the Open Data Hub project responsible mainly for all research about data engineering tools. And that makes room for a disclaimer. So this is more a data engineering and analytics talk than a data science. So you are warned about. Let's talk about the agenda. I'm going to talk and show some demos about the Jupiter Hub and a patch exactly. And after you see how both tools work, I'll try to find the matrix about the feature matrix about them. Try to compare both tools and then we're going to the conclusion. All right. So you might imagine why the talk title is that, but yeah, I was inspired mainly by this picture. So I think it would be a good picture that summarized the title. So maybe you guys are expecting that I'm going to make those tools fight and see who's the winner. Let's see then who's the winner. All right. So no more talk. Let's go to the tools. Let's talk first about Jupiter Hub. I think everyone knows about Jupiter Hub because it's widespread across the community. All data scientists use. So just some bullets about Jupiter Hub. So Jupiter Hub is a web-based notebook and you can use as a computational platform where you can create live coding, equations, text, visualizations, dashboards, and any other media. Of course, it depends off the language that supports. I'm talking about a specific engine called a kernel inside the Jupiter Hub. And well, there are many kernels for Jupiter Hub. It depends that you need to configure to use those kernels. So yeah, this is a very useful tool. The community is huge and there are tons of examples in the community. So let's see how it works. Jupiter Hub, it's a Python library. So all you need to do is use called our PIP to install it. I think I have it. So Jupiter notebook. Well, running this command, you have the Jupiter notebook running. But there's an alternative command for that, which is the Jupiter lab. I'm going to run it to show both interfaces. I don't need that because I already have the notebook. Just give me time to resize. So can you almost see the font? It's better. Okay. So that's the interface for the Jupiter notebook, which is different than the Jupiter lab. So there are slight differences between them. Jupiter notebooks, a very simple server where you can run your notebooks. But if you want something that could be used for multiple users, multiple kernels, Jupiter lab is another option. But they are kind of the same. Okay. So I created a very simple notebook integrated with Spark, actually PySpark. And I have a very simple data set. So let me run some of the commands inside the Jupiter Hub. One thing to note is that I'm putting some environment variables inside this block to make PySpark to download some special packages. In this case, I'm talking about getting some S3 and Amazon AWS dependencies to get a data set inside the S3 bucket. Okay. So creating the Spark session, configuring the S3. And then let's go to the S3 bucket, get the file. It will take some time because it will need to install the packages. Oh, it's already there. Okay. Some tweaks I needed to do for the data set. And then that's the part where I'm going to handle with the data. So all I did is create a Spark connection and get a data set over a S3 bucket. And then I converted that to a pandas data frame in this block indicated by the far number. And now I'll show some of the visualizations I can do with that. All right. So it's a pie chart. So I'm going to use that. And this is going to return an error and I'm going to say why. Add less. This is a bar graph. All right. Using two levels of grouping. So this is very simple code. But there are some things that I like to point out in this notebook is that, well, I needed to write some code that I don't need to, I don't need to know about it because if I'm expecting to just run some data analysis, all this code from the block one to four, one to three, I think it's better, should be necessary for me. Right. My job is to analyze data. I don't need to know the details about where the Spark is, where the data set is, or even how to create this Spark connection. All right. So this is one of the things that makes me a bit annoyed to use Jupter. And there's this other thing about the second visualization. So what I was doing here, I was trying to create a bar plot by grouping all the estimated cost feature in my data set grouped by trip region. Okay. So this data set is basically about getting some sales opportunities that I had and how many days I need to spend, how is the estimated cost and what is the region I'm going to do that service. So the problem is that when I created this data frame with Pandas that came from a data set inside the S3 bucket, it created using generic types and that made all the data set to use only strings. So this is easy to check by these types. So as you can see, everything is in the same type. So what happened here, it didn't try to get as a type of data in my data frame in order to use some special fields like numbers to make some summarizations like the sum I was using in this visualization here. Okay. This sum. All right. Anyway, I could do it easily. It's just a simple command opening just like a Google doc special that can run Python code and then I could generate some visualizations using some data frame that started anyway. All right. Cool. So that's Jupyter Hub. There are other features in Jupyter Hub and also Jupyter lab that can make this experience better like creating multiple users to start these notebooks like putting some logging mechanism, authentication, authorization and so on. But let's just focus on the notebook itself. Okay. So moving to Apache Zeppelin. So what is Apache Zeppelin? This project is fairly new compared to Jupyter Hub and it belongs to the Apache Software Foundation and it's still an incubating project. So what is the main features about Zeppelin? So it has the same intention as the Jupyter Hub to create notebooks where you can use text, interactive data visualization, equations and everything else. Okay. However, they have more good data visualization tools and it's fairly native and as well as there are multiple language backend involved so you can use multiple languages as well like Jupyter Hub but there's a little difference in Zeppelin compared to Jupyter Hub. In the same notebook, you can use as many languages you want different than the Jupyter Hub because you create a specific kernel only for one language. So if you're creating an R kernel in your notebook, it's only R you can use inside your notebook. Zeppelin is different. In each block, you can use specific kernels. In this case, Zeppelin uses the concept of interpreters. All right. You can share your notebooks, your blocks, you can deploy them as a single, as a multiple user and things that Jupyter Hub also can do it. All right. So let's go to Jupyter. Well, Jupyter, sorry, Zeppelin. So Zeppelin is a bit different. It has a whole distribution where it can download. It must uncompress and run the binaries from here. It's based on Java, so you need a GPM in this case. I think for many of you, it's already an advantage of Zeppelin compared to Jupyter Hub. Well, anyways, let's see what Zeppelin can do. It's already starting. So now let's go to the dashboard. So this is Zeppelin dashboard. It's all some different things here. So like this is the part where you can manage your notebooks. And as you can see, there's this anonymous login. So for the default distribution of Zeppelin, you can log in as an anonymous user, but you can put some alterations back in, like LDAP or other things. And as well, you can define a remote location to start your notebooks, like an S3, a Git, back end, Azure, or even a MongoDB back end. So as well as Jupyter, I created the same notebook with the same visualizations, but can you see the difference in this notebook? How many lines do I need to load my CSV file and create the visualizations? So it's even clearer, right? And maybe it's not, okay, not better. And maybe you notice that there's this percent sign specifying the kind of kernel I'm using. It's part by part. And right below this blog, I'm using SQL. So what's the difference? I'm using different back ends in the same notebook, right? But in this case, this blog, the SQL blog depends on the above. So let me run this blog. It's reading the CSV file and using Spark SQL features is going to create a table called sample data, okay? It's already created. So every SQL blog I have, I can just run a SQL and it will generate this interactive data visualization. And in this case, I can make some changes like if I want just a simple table or a pie chart, area chart, line, scatter, or whatever you want. You already have the legend. So far so good. Okay. You can download this data as CSV if you want or CSV. Okay. So like Juptern, I have the same SQL statement to generate the other visualization as a bar plot. Let's see what happens in here. Oh, so definitely could create, right? Because he's using some specific blocks in PySpark that could generate this table and can guess the metadata inside. All the columns that should be treated as a number, he could detect. And because of that, I can create this visualization. So it's a bit different in this case. And the other one, okay? The same, the same bar plot I had in the other one. But without doing anything, I could make two levels of grouping. And I could have a better visualization generated in this exactly notebook than the Jupter one. Right? Okay. What else? So for each block, you can get a, the link only for this paragraph. So if you want to share this paragraph to some kind of report or whatever you want, you can get this specifically can just import into your HTML code. And you can have this, this same visualization in an interactive way. Right? Okay. I think that's all. At least for the Zeppelin. All right? Okay. So just to make things better, Jupter Hub and Zeppelin treat some concepts differently. So in Jupter Hub, when I talk about the kernel, in Zeppelin, they call the interpreter. Right? And in Jupter Hub, the block, the block is that piece where you put the code and run independently from the others. Okay? In Zeppelin is a paragraph. In Jupter Hub, there's the notebook. Well, you can call in Zeppelin dashboards. Oh, there's one thing I was forgetting about the Zeppelin. So this is one important thing. I'm running PySpark codes. Right? So where's the code to create this connection? Right? Good question. Where it is? So that's one of the things that it makes, in my opinion, Zeppelin a little better than Jupter in terms of user experience. Like, all I need to do is to go to the interpreter configuration. So just click on the username and go to the interpreter. And there it is. Here. That's the last one. Spark. So the interpreter, it's just like a kernel in Jupter Hub. But the difference is when you create an interpreter in Jupter Hub, in Zeppelin, I'm going to get used with the terms and not make confusions. All right. So Zeppelin and Spark interpreter. All right. So when you create an interpreter in Zeppelin, it will create all the objects you need to use that technology. So it can be Spark, it can be Cassandra, Elasticsearch. So all you need to do is put the configuration in this interpreter configuration page. And then when you just specify what kind of paragraph you are using, and in this case, you can use all of these, like headers, we can put that way. So you can use Spark, Spark SQL, Spark Dev, Spark PySpark, and Sparkar. When you use this header in your paragraph and you run this paragraph, it will instantiate this interpreter and create those objects for you. So either the Spark or either the Spark context objects. All right. You can use globally in shared process. You can set permissions to it. And in this case, I'm creating a local master, but it can be used a remote master. All right. Okay. So with that, there's one other cool thing you can do it is when you run the paragraph, you can go to this interpreter page and you can go to the local Spark. So now you can look at all the code I ran inside this notebook by the Spark dashboard. All right. Maybe these are the SQLs I ran. Yeah. So there it is. Okay. So this for me is one of the features that make exactly the best in this case. Like I don't need to care about what is the location of the backend I'm going to use, either Cassandra or in this case Spark. Yeah. I don't need to know. I just need to know that there's an object with a running connection for me and I can use it to run all my notebook. Right. Just one simple quick thing. You can also monitor all the Spark jobs in here. All right. Okay. So given all this comparison, maybe you're already asking yourself or maybe expecting me that I'm going to give the trophy to the winner. Well, I'm going to say who's the winner. I mean, the winner is you. You, you, everyone. The users are the winners because we have two very good tools to do the same thing. But that's one of the things that my, my expectation of the possible scenario to it. So as I said in the beginning of this talk, Jupyter Hub is perfect for data scientists because it's simple. Right. It has a strong community. There are lots of examples about you can do anything in Jupyter Hub and you can Google it. You will find a great example to use the technology. Right. As for Zeppelin, well, as I said, Zeppelin is an incubating project. So it's still in that part of make the code mature, but it has, I see a good, I see Zeppelin with good eyes. There are many things, many cool things you can do exactly. Right. So my expectation is that Zeppelin could be good for data analytics. Right. And the few examples that Zeppelin has, you can see that when you go to the Zeppelin page, you already have a folder called Zeppelin tutorial with some good examples to it and using other technologies. So the few examples we have in Zeppelin, it's already useful for anyone. Right. I mean, maybe Spark could be the majority of the examples we need to know. So we already covered by that. Right. So that's what I think. Both tools can live together in a business environment. Why? Because one is good for quick data exploration and the other one can create powerful data analytics and even interactive. I mean, there's something that I didn't show you exactly which can be useful. Like, for example, as well as Jupter, there's a markdown interpreter. And with that, you can do something like hello and put some special character here saying that this could be parameterized. And when I run this paragraph, look what happened. It creates a form for me. So I can write everything here. Like hello devconf. If I hit enter, it's there. The value is there. So not only with markdown, we can do that. You can do either with SQL. The SQL paragraphs can be also interactive by using these forms. You can, well, there are many things to talk about Zeppelin and Jupter. So that's the part where I say this is basically all my knowledge about these tools and I think they are great. And I think they are both for specific scenarios and they can live together in the same team. Like Jupter for data scientists and Zeppelin for data analytics. Right. So these are the references I used. Of course, those two project websites. And this particular one is techshare.io. It's a community website where people can create some tools comparisons and the community goes there and creates some bullet points about what the one tool has as a feature, what the other tool has, and so on. And it kind of creates a good dashboard to compare both tools. So it's not Jupter versus Zeppelin. There are many other ones. So I think this might be a useful link for you. I'll wait for people to arrive now. Okay. So as I was saying, I'm part of the Open Data Hub project team which creates all AI workloads for open shift and Kubernetes environments. So what are the, what's the status of these tools to be integrated with Open Data Hub? So first, Jupter Hub as it's a very wide user tool for the community. It's already there. So just go to opendatahub.io. You'll find the Jupter Hub deployments in there as part of many other deployments. We have things for data engineering. We have things to deploy models, monitoring, and so on. As for Zeppelin, there's a working progress repository to create an image to run Zeppelin on containers. And soon we're going to create an operator for it in order to deploy on Kubernetes and open shift instances. All right. Well, I know it seems boring. The whole talk seems, you know, I'm all trying to do with this talk is make this comparison and show up one good, powerful alternative to use for data analytics or even to run your own notebooks. I mean, I like Jupter as well as like Zeppelin. And that's why I think both should have the same visibility in the community, right? So I think that's it. If you have any questions, I think there's one. Right. So I'll repeat the question and you'll tell me if it's right. Okay. So the question is if using a Python paragraph and an R paragraph, if I can share the same variables between them, right. Yes, you can. Because there is a special paragraph where you can inject this object inside internal Zeppelin interpreter. And you can share these variables across any other paragraphs, no matter if it's Python or R. It's a bit complex to do that. But I don't think they will keep this complexity so long. I think they will find another way to do this easily. But as for example, we're using SQL and PySpark in this case. So I just created a table and the table was available for the other three SQL paragraphs. So in this case, it was just native. But for the other languages like Python and R, you need to manually inject these variables. Okay. The question if I understand this, is there situations where Jupyter Hub is not a good use case but Zeppelin is? Okay. So it's more about notebook migrations between Jupyter and Zeppelin. Yeah. I don't think they have any kind of compatibility and I don't think they are working on that because they have different formats. Although you can, if I remember, you can export both as a JSON format but the fields are not the same. So it's not so compatible between them. So yeah, if you're going to create your notebook on Jupyter Hub and if you're going to use Zeppelin, you need to create another notebook. There's no compatibility between them. Right. So the question is, if Zeppelin has any production-ready applications, yes it is. Cloudera uses Zeppelin. So in order to make some analytics things, Cloudera gives the Zeppelin interface to create these dashboards. Okay. So the question is the performance difference between Zeppelin and Jupyter? Well, okay. I can be in a bad mood in here but let's not forget that Jupyter Hub is a Python library and Zeppelin runs on Java. So it's one of the things to consider but, well, I can tell for sure but I heard that in Open Data Hub we try to use Jupyter Lab but the problem is that it's CPU intensive application. So we decided to use just Jupyter Notebooks using a special spawner called the Cube Spawner. So when you log in to OpenShift, it gives you a button to deploy your own and dedicated notebook server like using the Jupyter Hub space notebook but using special parameters like configuring S credentials, other environment variables and so on. So, yeah. It's not a fair comparison but, you know, let's not forget that they have differences. They are implemented in different technologies so you have pros and cons in each of them. There are two more questions. Right. So the question... Yeah, so if I understand the question is about the collaboration features between both tools and, yeah, I owe you that because I just show how to collaborate to share paragraphs in Zeppelin but I did not show anything in Jupyter Hub. Yes, there are collaboration tools, things like sharing notebooks with Jupyter Hub but in this part I'm not so experienced with Jupyter Hub. In this case I'm more experienced with Zeppelin with regarding creating collaboration reports. I was trying to show something here but since I'm using localhost and there are some specific things to do, I try to create some fake reports and then all I did is put the paragraph links in an iFrame inside this HTML so just because it's localhost I can and it's a very default distribution. I didn't have time to show, to configure well the Zeppelin but a well configured Zeppelin server can give you a bookmarkable URL of your paragraph that generates some kind of visualization and you can embed in any HTML file, right? So the question is the notation I did for each notebook is that is like creating data and what I'm going to do with this data inside each notebook, right? Yeah, so the intention is just to show up a very simple use case to get the data and make some quick exploration and create some simple visualizations in them but I put Spark in the role because think about I'm talking about a very simple data set that has no more than one megabytes but think about you're handling with gigabytes, terabytes of data, how do you handle this in such notebooks? Well in this case I added Spark in this role because it can help you to handle large amounts of data but in Jupyter you need to know how to configure Spark to get the best of the tool to handle this data and not giving headaches, you know, because think about handling gigabytes of data in a notebook. You're using Spark but Spark needs some fine tuning to handle all this data. So think about Zeppelin. I don't need to know how to configure it. I make the assumption that someone created, deployed Zeppelin and created the best interpreter configuration in Spark in order to, if I request any kind of data set no matter if it's megabytes, gigabytes or terabytes it's well configured to do whatever I want in my analysis. So this is one of the biggest difference I see between Jupyter Hub and Zeppelin so it's the development experience, right? I need to, I need few lines to do something that I could create more lines in Jupyter Hub and I need to know more about technologies in Jupyter Hub than I need to know Zeppelin, right? Does it make sense? Any other questions? All right, thank you.