 So my name is Zach Hassan, I work at Red Hat, I'm an engineer on the data analytics platform. So a little bit about me. So the project that I work on is called Rata Analytics IO. We use Apache Spark to power our analytics engine. And today I'm going to be talking about how you can use Apache Spark to build data pipelines. So the agenda for today is I'm going to talk a little bit about data preparation, which is pretty important before you build a data pipeline. You need to make sure the data is in a format that's structured, a format that is usable, queryable, and we'll talk more about the criteria of what a good data format is. We'll talk about different technology that you can use to extract data from different systems. And then we'll talk a little bit about stream processing and how you can use Apache Spark running in containers on OpenShift. So if you don't know about OpenShift, OpenShift has Kubernetes and it's our platform as a service. And everything I'm using here is all open source things you can download and use. So let's get started by talking about data preparation. So when you're preparing your data, you're getting data from so many different systems. So for example, one big problem is since you're dealing with so many different data types, different systems that you have to integrate with, you know, and then you have sensor devices and then you have different protocols. For example, there's the MQTT protocol, there's AMQP, there's so many different protocols that these different devices might use that you want to consume from and then do analytics and do other different things around that. Then you might want to have your data, you know, you might get a lot of social media data and you want to do sentiment analysis on that, which is something that people use Spark for. And anything from tracking inventory to just basically trying to give more value with the data and get analytics. So let's talk about one technology in particular that I find very useful for data preparation. So when you're getting data from different systems, right? Often what happens is you write integration code to connect to MongoDB or you write code to connect to some JDBC or you might write some code to connect to MQTT protocol or AMQP. And the nice thing about Camel is it has components that you can utilize to connect to those components. You can connect to those systems by using these components and these are pre-built. And the interesting thing about Camel is it's open source. It has over 200 connectors that you can use to connect to all these different systems. So the criteria is you're going to try and get data from a source to a sync. So what does code like that look like? So Camel, you can write code in Java. You can get data from some location. So here we're using a file component. But that can be easily replaced with, for example, the Kafka component or MongoDB. You just replace file with that component. And as you can see here, you can see I'm pointing to the word extract. The from statement is very declarative. You're writing this code. You have to worry about making sure that this code connects to MongoDB or connects to all these other differences. This is already done for you. And then when you get your data, it goes through the next space which is you're chaining the bean. So you're passing your data through a bean. And then in that bean, you can write custom code to do some custom validation, custom things that you want to double check on. And then you want to send it to a sync, which is that two statement over there. So you're loading that data somewhere. So what you could do is actually send that data to Kafka. And then once the data is in Kafka, Spark Structure Streaming can pick up that data and do some more stuff. And we'll talk about this and we'll have a demo at the end of this talk. So let's first try and understand just a very basic concept. I'm not going to go too long with this, but basically you have a sender and you have a receiver and you have a message. So when you send a message, right, you're sending it somewhere and then somebody else is receiving it. In messaging world, we call this producer and consumer. So things can get a little bit more complex. The more you add more systems to this, the more the rules change, different criteria that you have, different business requirements. So you're going to have to do different things. So in this diagram here, I'm showing that there's a new order coming in and then I'm using a wiretap. So as soon as a message comes in, I'm making a copy and I'm sending it somewhere else. The message goes through, then I split and then I send it through a recipient list and also filter here. So it's not very visible here on this screen. Let me see if I can make this a little bit bigger. It's not very visible, but basically this is XML that will pretty much have a filter rule. If this criteria is correct, it'll log that message. So let's talk a little bit about Kafka. So now that we understand the data preparation portion, and now that we've validated, we cleaned our data, we did everything that we needed to do to prepare that data, now we send it to a messaging queue or a messaging topic in particular. So Apache Kafka is an open source project. It's a distributed, fault tolerant, replicated commit log. It uses ZooKeeper for high availability and messages arrive in what's called a topic. And once the message arrives at a topic, so as you can see over here, the producer sends the message. It arrives in Kafka. When the message gets received, the consumer picks up that message and does other processing. So traditionally people use HTTP. They write web services. They send a call to another web service. This is more of a fire and forget. So you fire that message and you let the consumer pick it up. And you let the consumer do what it needs to do to complete that task. So that's what we use Kafka for. The nice thing about Kafka is there's so many different, you know, a really rich ecosystem where they have connectors to Apache Spark. You got connectors to, you know, there's also spark connectors for the Apache Camel comes with to send messages to Kafka as well. And then there's also Kafka Connect, which is another thing as well that you can look into as well. So let's talk a little bit about data formats. So you have schemas. And everything has a schema, either it's schema on write or schema on read. The important thing here is when you have schemas, you want to have it versioned. You want to have something where it's documented. Because what often happens sometimes is some companies is, you know, some engineers hire to do a particular job. They, you know, create the schema and then they're gone and there's no documentation for what does this field mean. And then it takes a long time to track down what these fields actually mean. So it's very, very important to have schemas and have it versioned and have it documented as well. And we'll talk about different types of schemas. So let's just take a look and see what type of different data formats there are out there. I'll just mention about seven or about six of them that are pretty common out there. So there's text. You can put all your data in text. You can choose to do it in JSON, which is pretty popular. Then there's Avro, Parquet, Thrift, and Protobuf. All these ones have their own different advantages and disadvantages. And I've highlighted with a check mark and an X mark which ones have these features. So you can write, you can store all your data in text and you can store in Hadoop and you're okay. But you're missing the part of compression because compression, you want to improve performance as well. And you want to have a schema where you know which data is associated to which particular field. JSON, you have a JSON, you have an array of objects within that JSON. It's not splitable. You can't really do compression on it. It doesn't have like a schema. It's semi-structured so you can add more fields to it if you want. Avro is a particularly interesting data format and it's a good data format to use. And I'll show a demo of working with Avro and also I'll talk about working with Parquet. Those are the two top ones that check off all the check marks in the list. So the compression is good, schema, it has one, and it's splitable. And then we have Thrift and Protobuf, which they have schemas but they're not so great with the other categories as well. So the two that check off all our lists is Avro and Parquet. And I'm going to talk a little bit more about Avro and Parquet. So Avro is a data serialization format. So you're storing your data in a particular format. And one of the particular things with Avro is it has an external schema file. And I'll show you how that looks like. And that file is like a JSON format and you store and you can version that file. And that is your schema. And then use that schema to read and generate more, just to work with the file basically. And I'll show you all that stuff as well. And then there's one particular Java library that I happen to really like, which is the Jackson data format binary. If you use Java, that's a good library for you. And then there's Avro tooling, which is really cool. Because if you, I'll show you what it looks like when you print or you try to output all the data into the terminal. You want to see what's in this Avro file. So Avro tools gives you a way to see what's inside those files and deserialize it on the terminal. So this is what a schema looks like in Avro. So as you can see over here, you have the type, which is record. You give it a name, you give it a name space and you can write some documentation in there, which is really helpful. So anybody who wants to look at this schema, they'll know that this schema has this particular documentation associated with it. There's fields and over here under name, name is optional. Whenever you put null string or you put null int or you put null whatever it is, that means that this field is optional. So you can choose to provide that field or not. And I'll show you the difference in schemas when you look at parquet as well. So before we go further, feel free to stop me at any time and ask questions at will. Just wanted to let you guys know as well. So let's look at a demo of all this stuff. So we have a file called customer.avro. So customer.avro is a serialized file. I've serialized it and I've gotten some data and I've stored it in this file. So first reaction for a developer would be like, okay, I'm going to just cat out this file. Let's see the contents of this file. And when I do that, I get junk. As you can see, I have a bunch of junk. We get some data, some schema stuff, but where's the data? I want to see the data. Okay. So if I decide to instead of doing that, if I decide to use Avro tooling, it's as simple as Avro tools to JSON. Or you can put it in different, but to JSON is how you can do it as well. And then you have the file and there you go. You got your fields. So you have names, Steve. Ideas one, Jeff. Ideas two and Justin. Ideas three. Okay. So let's look and see the parquet file. Let's see if we can, you know, cat it out as well in the terminal. Let's see if we're going to have the same problem. So when we see the parquet file, we just cat out that file. We get a lot of junk. As you can see, we have all this junk here, but we don't have the data that we're looking for. Okay. So, um, so in this particular example, I'm adding like a couple of fields and stuff like that. But if you're, if you're choosing between these different formats and you're looking for performance, then if you're looking for role oriented, you're doing queries with roles, then I would go with Avro. If you're doing column oriented query, then I would choose parquet because parquet uses column R technology. And, um, columns are stored together in HDFS block. So I would choose parquet if you're looking to query columns. I hope that answers it. So as you can see, I got a lot of junk when I did the cat command. And now when let me clear out this terminal, there's another tool that you can download as well where you can actually query parquet. It's called parquet tools similar to the Avro tools, but it has different commands. And, um, before, before we, we, let's, let's take a look at that. Okay. So we basically do cat, rock and roll dot parquet. And as you can see, we can, we got our output, you know, the product type, product ID, price, quantity and whatnot. So, um, so parquet is a little different because the, the, the schema is actually in embedded in the file in the folder. So if we want to see the schema, we can just do this parquet tools. And then we just do get schema schema. So as you can see, we can see that our schema has an ID product category name price. Okay. And now let's take a look at getting the schema for that Avro file that we were looking at the customer Avro file if we wanted to use the terminal tools. So if we go Avro tools, okay, get schema and customer. There we go. We have our schema. We have this is, this is the name of the file is name space. This is the documentation. And this is the, the ID and this is optional to as well. Okay. So now that we've covered the tooling around Avro and parquet, we're going to talk more about a little bit more about parquet as well. One moment. So a little bit more about parquet. So as, as I mentioned, it is data serialization format, kind of like Avro, but more oriented for column R technology. But I mean, I'm sure you heard of column R technology before you probably heard of orc. So, you know, often people ask, okay, you know, if why don't I just use orc, right? You could use orc. But then again, if you want to operate outside of hive, then it's better to work with parquet because you can use it in hive and you can use it outside as well. And you can query it with Spark SQL as well. And we'll, we'll look at the demo for that and it integrates great with Spark. So, you know, if we look back at what life could be like with hive and then we'll, we'll look at working with Spark. Okay. So hive is, you know, more of SQL. You're writing SQL to create tables and then you, you know, you, you do select statements and whatnot. So this is what hive looks like. If you were to take that parquet file and then put it in a table and then query that table. That's what it looks like. What if you wanted to use Spark SQL instead? This is how Spark SQL looks. So basically with Spark SQL, you go ahead and you set up the, that this is the HFS location you want to query. Use Spark to create a session. And then you do this SQL. But the interesting thing is you have a declarative, you know, API that you can, you can use, you know, and it's nice to have a programming language that you can just do group buys and do different, you know, SQL type of queries, but then you're just using an API for it. And there's interesting optimizations as well with Spark. So let's talk a little bit about Spark. I know I mentioned it a couple of times during this talk. I want to dive a little deeper into Spark. So Spark is a distributed, you know, parallel framework where you're doing distributed computing. And you're distributing across a cluster. So our team packaged Spark in Docker containers. And we run it on, you know, Kubernetes pods within OpenShift. So, and you're going to see that in a demo. So let's just look at high level, you know, Spark. What's the pieces, the building blocks of Spark here? So we have Spark SQL where you can do querying, querying data, whether it's, you know, data in HDFS or data in S3. You have Spark MLib where you can do machine learning. You have Spark Graph where you can do graph processing. And then you have Spark Streaming where you can stream data in. Now there's, obviously there's the old way to do streaming, which is just Spark Streaming. And then there's a new way which is Spark Structure Streaming. And that's what I'm going to be demoing today. So the interesting thing is you can run Spark standalone on Yarn or on Mesos. But the way we run it within, you know, Kubernetes pods is we have, we have it running as standalone. And you can have more than one Spark cluster if you want. You know, and you can have a job run. And then that has its own Spark cluster associated with it. And when the job is finished, then it tears down your Spark cluster. And then you have another job with a separate cluster. So you never interrupt anybody else while they're doing their stuff as well. You get multi-tenancy by default because you're running within OpenShift. And OpenShift has namespaces. So different, you know, you can have different projects. And then you'll never have to, you know, have conflicts with each other. It's not one shared cluster that everybody shares with each other. You can have your own independent cluster. And then as I mentioned before, data formats are very important. The interesting thing is Spark has APIs where you can write to Parquet. You can write to Avril, or JSON, or CSV. And also a lot of data access as well. So let's look at high level. What's the architecture look like? So we have a master, and then we have one or more workers. And these workers have a JVM process called executors. And these executors execute the work. And what we have is a Java or a Python or a Scala or an R program that we submit to this cluster. So when we submit that program to this cluster, we call that driver, okay? It gets submitted to the cluster and it gets scheduled to run on one of these workers, the executors. And then once the job is done, it's done parallel. And for write analytics, we're running these in separate pods. So let's talk a little bit about the demo that I'm going to be demoing today. It's an order processing application. So you have a Hadoop cluster. You are basically, you have a web service that accepts orders. When you get an order coming in, we do two things. We store it in Cassandra database. And then we send a message to Kafka. So we're doing two operations there. And then once we send the message to Kafka, as soon as the message arrives within Kafka, then what we do is we spark structure streaming, picks up that message and does some processing and then turns it into a data frame and then turns it into a parquet file and stores it in a directory. And as data is streaming in, it's continuously updating. So this is more of a visual of what's going on here. So as I mentioned, an order event happened. It goes to Kafka. Kafka sends an event to Spark. Spark picks it up. Spark streaming goes and stores it in parquet in HDFS. And as this data pipeline is streaming data continuously, I can go ahead and in a Jupyter notebook, I can query the data. I can find out how many orders do I have for item number one? How many orders do I have for item number two? How many orders do I have from a particular geography? I can do intelligent queries while the data is streaming. So I don't have to wait for some job to run. I don't have to run no cron job. I can just query it right away. So I guess it's demo time. But before we go to the demo, let's look at the code a little bit, okay? So actually, we'll go to demo and then we'll discuss the code after. But basically, this is the Camel web service that I have. And, okay, so this is the Camel web service. So the interesting thing with Camel is you can actually set up a web service using Camel's API saying that whenever there's a message or a JSON that gets posted here, just send it through and just send it, make sure that, you know, cast it to this data type, you know, and then just send it to this bean here, order service, okay? And then that order service is going to go and let's go take a look at order service, actually. It's going to go and do two things. It's going to go ahead and create an event and then just send it to Kafka to a topic. And then it's going to, it'll also save into Cassandra as well. Add order, okay? So now let's see what that looks like. Give me a moment here. Okay, so we're going to start off over here, okay? So currently, I've already deployed the Camel web service that I showed you. It's already deployed. It's running in a container within a pod, a Kubernetes pod within OpenShift, okay? And this is the free OpenShift that's on OpenShift Origin, right? So I'm running this. I have that pod running, okay? So basically I wrote a Python program that's a Spark Python program. And what I want OpenShift to do is get my source code. I don't care about creating a Docker container. I don't care about all these other things. All I care about is that OpenShift goes and gets the Python source code, converts that into a Docker image or a container, and then stores it in a registry, a container registry within OpenShift. And once that container registry gets that image and it does the build, then I want it to go ahead and submit that to a Spark cluster once the job is ready. And also, it's going to also create a Spark cluster as well. We'll see this video. So this is the build happening right now. So it's building the source code and then it's now creating a container for it. So this is pretty fast. It's going to quickly build. And we're going to take a look at the logs here. So the thing with OpenShift is telling you exactly what's going on, that it's building the layers, and then it pushed it to the registry. And now you have a cluster here. So these two top ones over here are clusters. This is a cluster. So this is a one node cluster. So you have one master and one worker here. And what you have there is you have your streaming app, and then you have your build. Everything getting built. So now we're looking at the logs. And the worker is saying, where is the master? It's trying to connect to the master. So once it connects, then now the job is being submitted. So as you can see, this job, it's actually going and connecting to a particular Hadoop cluster. So I specify in the configuration the Hadoop cluster that I'm connecting to. And also the packages that are being used. And it's going ahead and submitting this job to the cluster. So this Python application that I wrote, I didn't write any Docker. I just submitted it to this cluster and let OpenShift do the heavy lifting for me. So this process is pretty quick, as you can see. So now it's connected to HDFS. So this is HDFS here. One interesting or important point that I want to mention right now is if you have a stream that's running, you want to make sure that it's fault tolerant. One interesting technique there is watermarking. So with Spark Structure Streaming, it lets you do watermarking. Because when you get a message, there is an offset in Kafka. And you want to make sure you don't go back in time and process the same message again. It's like customer orders, a pair of shoes, and you charge them twice. It's not good for business. So here, once I set this up, it creates the files. And this also has checkpointing. And I'll show you the code of how to do checkpointing in a few minutes after this. So this is a Jupyter notebook here. So I'm just taking little snippets of which parts that I'll need. And I'll start up with a new notebook where I'm going to do ad hoc queries as data streaming in. So the data is streaming in continuously. And at any point, at will, I can query and say how many orders do I have? Which orders do I have? Give me the ID number for those orders. At will. I don't have to wait for a cron job to do it. It's streaming. So I'm setting up the notebook. And this snippet here, what that snippet is doing, it's going and connecting to the HDFS node with the particular directory where all the parquet files are being stored. And then it's doing a query. So once I'm connected here, I get a panda. So panda has their own data frames. It's different than the other one that we're talking about with Spark. And then it's nice to use pandas with notebooks because you can get printouts and counts and you can do interesting things within Python. So over here I have this Python client that I'm using to simulate a customer order. So this Python script is basically just going to pick a random customer. It will pick a random product and it will issue an order for that product. And we're going to see in the screen, the screen back there that I just opened up right now, that screen is listening in for all messages that are coming into Kafka. So it's going to print out the JSON within that screen. So over here, in this demo actually the IP address needs to be fixed. So we'll skip a little bit of it. And as you can see, I correct the IP address there. So as you can see in the background, the order just came through. So now there should be one order. As you can see, we have one order. And it's grouping by the product ID. So one product. And then I run it again. I run that client again to run and send another order. And then I'm going to run a watch where that watch is going to continuously, at a particular interval, keep on sending orders. And I'm going to keep on running Jupyter Notebook. And as you can see, this table is getting updated. Now we have five orders, six orders, ten orders. And this is streaming. I'm not running any other program other than this Spark Structure Streaming program. And this is, that is the files. Those are the files, actually. Let's take a look here. So this is where the files are getting stored. So they're getting stored and compressed here. So you can download one of these files and you can use Parquet Tools to print out the contents of any of these files, if you will. Since Spark is connecting to a folder there, it'll be able to query all those files that are in there. It's not doing individual files because of Spark's, you know, technology. So we're going to take a quick look at one more thing, which is how do you do check pointing, which is pretty important. You know what? We'll just, we'll look at it through GitHub. It's a lot easier to read on the screen. And this is the code. So basically I create a Spark session. I connect to Kafka, the topic that I want. And then I'm getting a JSON, I'm getting a JSON object and I'm turning that JSON object into a data frame. And then I'm basically writing this, write this stream to this HDFS folder and make a checkpoint at this location. Okay, and this checkpoint location will have all the metadata of like the offices and all these other different things, which are very important. So I just got two minutes, three minutes left. So I'm going to quickly take questions if anybody has any questions. So in Python Jupyter Notebooks, you can use pandas to do, you know, different things. For example, some folks use pandas and then they use matplotlib and then they create graphs and charts and different, different things there within Jupyter Notebook. Let me get back to the slides. I just want to recap. So just recap of what we, what we've learned. So we've learned about ETL and using Apache Camel for processing data. We've learned about data formats like Avril, Parquet and comparing it with other formats. We've learned about integrating Kafka with Spark Structure Streaming. We've learned that we can do ad hoc queries on data that's getting streamed live. And we've also learned about my project, Rad Analytics IO, which is an open source community project. I highly encourage anybody to take a look. We have tutorials on there if you want to try it out, running Spark on OpenShift. And we've learned how to deploy a Spark application on OpenShift and how to run a streaming app. And that's all I have. If anybody has any more questions, I'm glad to take them. Also, I'll have office hours at the booth right away. Right after this talk, I'm going to go to the booth. So if anybody wants to stop by and take a look at this, I'm more than happy to go through that demo again. And also talk about the code as well if you want to go dive a little deeper. And yeah, that's all I have. Thank you. Thank you all for your time. Oh, and this is the link to the slides if you like. It should be good. Let's see. No, actually, it's not this link. It's on the link on the conference page. Just click that. The slides will be there. This link is an old different slide. Actually, we do have Jupyter Notebook running in Docker. And I can give you links to that, the Docker image. And that Docker image is pretty good. We even have it running with TensorFlow and Spark and everything. So all the heavy lifting is already done for you guys. You just need to just use that Docker image. So I'll provide you a link. If you want to talk to me after the talk, I'm going to be hitting downstairs to the booth or we can talk here or anywhere. All right. Thanks, everybody.