 Okay, I think we can start now. It's a point of people here waiting. Welcome to DevConf 2022. My name is Andrey Veselov, and me and Pavel Yadlovsky will moderate this session. Next up we have Ricardo Oliveira and Maulik Shah, speaking about data engineering for Java developers. This is the live session and there will be time for questions at the end of this session. Please also use the Q&A section as well. Now let's speaker share his screen and please begin whenever you're ready. Thank you. All right, thank you guys. Good morning, good evening, good afternoon, whatever you are in the world. My name is Ricardo. I'm a senior software engineer involved with the OpenData Hub project and the Roads product, the Red Hat OpenShift Data Science focused on AI ML solutions for cloud native, for the open, for the hybrid cloud environments. My main experience with data, speaking by itself, it's more about the engineering side. And well, I had this idea of doing this talk because I know that when we talk about data engineering, people always remember about Python or R, but they are wrong for any other, plenty of other languages, but I'll be focusing this one for Java developers. Okay, so here's the agenda for this talk. I'm gonna tell you by showing some demos that data engineers not fulfill. You can be a Java developer and still working with data engineering. I'll show some Java tools focused on data engineering or even APIs inside Java that could be useful for doing anything to manipulate data. And at last, we'll ask me anything about the topic. Before we move on with the tools, a bit of disclaimer. This talk is based on my own experience with these related topic, which means there might be other solutions that could fit better in some scenarios, but remember that this are all my experience and of course, I don't the absolute truth. The main purpose of this talk is to show that you can use whatever language you want to understand techniques or methodology on data engineering or data analysis without the need of learning a new language. It's not because the majority of professionals use Python or R for data engineer or even AIML. It doesn't mean you can't do it with your own language. And in this context, I'm talking about Java. So of course, I'm not here to prove that one language is better than the other, but of course, just show that there are ways for Java developers to go deep on understanding techniques and methodologies for data engineering without knowing other languages. On the other hand, there's a need to know the truth, the tools for data engineering to go deeper on techniques and methodologies. So, well, spoiler alert, this will be all about tooling in this talk. So there will be a lot of demos. Okay, so remember what I said, when we talk about data engineering, people think about Python or R, but of course, you can do the same thing with Java, right? So this slide is just a bit of a parallel of the tools you can use with Python and which one could be a, which technology or which tool can be used it on the Java side. So you have pandas to manipulate data. In Java have this screen API. You can use distributed pandas or PySpark. Spark can be used using Java too. MongoDB has no SQL. You have Cassandra, Airflow for data pipelines. You can use Nephi or Argo workflows and the letter is a cloud native solution. So for hybrid cloud environments, Argo workflows will be preferable. And speaking about Jupter, there's one that can be used with Java and it's a batches app link, all right? Okay, so there's three APIs that was introduced in Java 8. So since Java 8, you can manipulate data using streams API and in a way that you can use all the CPUs because all the processing is doing in multi-cores and also you can do this using a very simple fluent language. So let me show a quick example here. That's a very simple class that it's gathering this US elections data set, which is my personal curated data set for the votes count for each counting in the US elections in the past three elections. So 2008, 2012 and 2016. Well, I don't have 20 yet, but I'll make sure to update this. All right, let's focus on the good thing here. Yeah, there's this one which take an example on how to use the streams API. So the other methods are just downloading the file from a URL and then it starts in my local computer and then I can open my file here and then I can use stream to process my data. So the line 47, it's skipping the header. Okay, it's better. It's skipping the header and then I am mapping the data is records to a class. Well, Java is a object oriented language. So because of that, we have this problem with, you need to map everything to a class. So I needed to create this US elections item to make sure I'm mapping all the data from the CSV file. Let's see how it works. Basically, what I'm doing is reading the CSV file, grouping them by the total votes by 2008 and make the sum of it. Let's run this guy. So, there should be one, I need to try okay with streams. Let's see, all right, so that's the results. This is the total votes of the democratics, the republicans and for other parties. Okay, so that's a very simple example of how to use streams with Java. So I'm going to run this guy. So I'm going to run this guy. I'm going to run this guy. I'm going to run this guy. I'm going to run this guy. So, moving towards to, let's suppose we want to do data processing, but with a huge amount of data. And let's think that this could be in a batch or streaming fashion, both depends on the scenario. But think about how to use Java to process the data and process a huge amount of data. You can use Spark as Spark has very useful models, including ones for machine learning. But it's very useful when you're handling a huge amount of data and you want to use this in a cluster to make sure you have enough resources to handle gigabytes, terabytes of data and so on. Here is a very simple example I'm doing with the same, but the difference here is that I'm just printing this schema from the US elections data sets. It's just a simple thing I did based on the, here on the local US elections data sets. So I'm giving the Spark four gig of memory and two cores to run this task. It's way more than enough for just printing a simple schema. I'll get to this later, but think about this one are important for logging. Okay, let's do this. Let's run Maven with Spark. All right, here I'd like to show something when we're using Spark, it's exposed as a part called 4040. So you can check what's happening with your Spark application. So let's look at it, 4040. Oops, what happened? Okay, so here's what happened. If you have a Spark application that's running faster and it finishes, you don't know what happened. Think about you are writing a Spark app and you have errors during the data processing. How can you take a look at that? How can you debug this? If this URL was published only during the Spark app execution, you need an additional two for that. So let me show you something here. Here, I'm my Spark distribution. I'm going to the SP and then I'm gonna run a guy called the history server, all right? Oh well, history server is already running, which is good. I configured this Spark server to run on to 8080. So what happens here? I configured my history server to watch this path from the event log here, all right? And with that, I can track all the Spark applications that's running because if you take a look at the logs path here, it creates a log file for each application execution, right? So with that, you can build there and look at all the problems if there are problems in your execution. Because it's just a print schema, it won't show anything, but for complex data processing tasks, it will show all the jobs and stages ran, as well as how much memory it consumed per stage, per job and so on, all right? This is just a simple example to show how to use Spark for distributed data processing, all right? Moving towards a NoSQL solution. So Meet Cassandra, which is one of the NoSQL data stores using the approach of a big table. And one of the biggest things about Cassandra is that Cassandra is replication native. So in order to have a stable NoSQL database, you must use the replication scheme in Cassandra. So you need at least three instances of Cassandra so you can have your data replicated through all the nodes. And with that, it's possible to make sure that your data will be available no matter what happened. You have also the advantage to use SQL as a query language as well as other things, all right? So that's my terminal. I believe I don't have Cassandra right here, but let's check if I'm the right path, all right? So let's run Cassandra, and it's just Cassandra. Okay, so Cassandra is running. Cassandra is already used because, yeah, that's because I already have Cassandra running. Okay, so moving on to this class, which I have a very simple example to create a key space and write data onto a Cassandra table which I created for US elections, all right? Okay, let me come in this. This is just another query I'm running to select all the dating table. Let's run this one, let's see. Probably evil, okay, this one was because of this guy. Where is this one can be deleted. It was an attempt to load all the CSV file from Cassandra, but because Cassandra uses the copy instruction is only for the C, the Cassandra query language and can be only used by the CLI commands not by Java or any client. All right, sounds like I'm having problems here. What's happening? 85, kill that, not fun. Okay, let's see if at least I have something in my key space, use US elections, and then let's query everything from this guy. I'm very sorry to interrupt, but it's just a reminder that it's three minutes left just to let you know. Thanks. Okay, thank you. So here we have some data which I created earlier, but I'm sorry, something happened with that code and it didn't work, but that's okay. That's okay, moving to the last two tools. So Nephi and Argo, I will show an example of it because I have a very simple scenario for data processing and Nephi and Argo are meant for doing complex data pipeline solutions, but just a comparison about Nephi. Nephi can be used as a visual data pipeline builder with many connectors available, but the biggest disadvantage on it is that it's not yet a cloud native solution, which is different from the Argo workflows that can be viewed data pipelines declaratively using YAML and it's K8 is native, right? Those one are tools that can be used to write any complex data pipeline that needs to get from many data sources and doing many manipulations on the data, okay? Moving to the last tool is the patch zappling, which is kind of a jupiter that has lots of subsystems available. So there are lots of kernel and it's good for dashboards too, all right? Okay, I have 20 seconds, but I might need one more minute for that. So I'm sorry to use this one minute more. Okay, so I think I have zappling running here. Let me see, nope, I don't have. Okay, so let me get the admin. Okay, I'll need to get from the readme. So I'm gonna run from admin run. So, okay, let me do this, which is better. So I'm gonna run zappling from a container, which makes things easier, but I'm mapping to my local path to show my dashboard I created in my repository. So basically I'm using Spark, I'm using Scala to load my CSV file and then I generate a table from it. With that I can use in other paragraphs, which is what they call these small pieces here to query everything from the table or making charts using a select statement, all right? Things that is good from zapplings that you can hide either the editor or you can like, I'm hiding all the select statements and then you can move to a reporting thing. So look at the results. You can make a full dashboard by just using simple SQL and very small pieces of code. And there is already Spark connected into this instance of zappling. So you don't need to know how to connect to Spark. You just need to use this Spark object available for zappling. And I'm very sorry, I'm taking two minutes, but well, if you want to follow all the demos I showed in this talk, you can just follow on the GitHub link I have here. There will be the data set I used, all the classes I showed and there's also a readme file containing the instructions on how to execute the zappling using my notebook as an example, all right? My Twitter is at Himmulid, sorry, it's just a Brazilian way to say this username. I don't think it would be good if I use my English accent on it. There's my LinkedIn too. And well, I conclude my talk here. If you have any questions. All right, thank you Ricardo.