 All right, so welcome back everyone. This is session room number four and We have Yakub here with us Yakub Schultz is a principal software engineer in Red Hat and the topic will be build your own social media analytics with Apache Okay, Yakub, the floor is yours Thanks for the introduction and yeah, welcome everyone to my talk about building social media analytics with Apache Kafka You already heard most of the introduction what I really do as my main work is I work with Kafka and with the project called Strimsy, which is about running Apache Kafka on On Kubernetes and that's really how I came up with this talk because these are the things I spent a lot of time with What I would really Love you to take away from this talk is that Yeah, you can do quite a lot of nice and cool stuff with Apache Kafka and Something like Twitter, but I also want you to take away that Apache Kafka is more than just the messaging broker which probably everyone heard about and everyone knows and that It makes it really easy to build fairly complex Applications without too much code and without too much effort, but don't worry. There will be some code as well and What's what's really important to understand is that? Apache Kafka, it's fairly big ecosystem over different tools and and parts and some of these are directly in the Apache Kafka project itself That's of course the the Kafka brokers But it's also the connect API for integration with other systems There is a tool called mirror maker for mirroring data between the Kafka clusters There's also the Java clients for consuming and producing messages From Java, there's the streams API for streams processing and there are some other and smaller components as well, but then what really Then shines is that there's a huge amount of different tools and components which is outside Apache Kafka project itself and that's for example all the different connectors for for the Kafka connect Framework, but it's also things like schema registries, of course clients for for other programming languages There are operators for running it on on Kubernetes such as the stream Z project I work on there are all kind of tools for stream processing ETL processing for for artificial intelligence or machine learning which integrate very well with Kafka There are different UIs for monitoring or managing the different Kafka components, so there's really a huge amount of things you can choose a bit and just by kind of Reusing them and building your applications together from them. You can do quite a lot there's also this this idea that Kafka isn't really messaging, but it's something called even streaming platform and Yeah, to be honest even streaming platform that sounds like a bit random buzzwords put together And actually if you would Google it there will be probably many different definitions But one of the definitions which I like quite a lot is that something would cause itself even streaming platform should Handle several different capabilities. It should be able to import the events from some other systems from some other platforms It should be able to store these events and then Distribute them to the applications which are interested in them It should be able to process these events that can mean all kind of different things from some transformations and Richmond and Doing some actions based on these events and so on and then at the end quite often you would want again to export the events or the results of these events to some other systems and Kafka does this really well the import and export part. That's what the connect API handles The Kafka brokers They are what handles the storage and The distribution of the events or of the messages and then the streams API and that's what you can use for processing The events together with all kind of different clients, of course so yeah, Kafka definitely matched this definition of events streaming platform and What we will really do in this talk and in the demos I will show later is We will start with with Twitter and We will be basically reading the events from Twitter and in this case the events They will be basically the tweets published by different people and we will use the Kafka connect framework together with something called Apache Camo connectors to get the tweets from from the Twitter itself and Get them into our Kafka brokers as a message into some topic and then we will use the streams API to do different kinds of analytics and processing on them and Just for fun, but also because yeah, that's really what I spent Most of the time on we will do all of this on top of Kubernetes So let's have a look in a bit more detail into the Into the different parts of this. So let's start with the Twitter Twitter, of course If you use Twitter then in most cases you probably use it from some browser application or on your smartphone or iPad But of course, there is some API which is in the background which allows you to kind of talk with the Twitter servers and ask for the tweets on your timeline or the retweets and likes or direct messages or search for some keywords and There is a version of this API which is available for free it has all kind of Rate limits of how often you can use the different APIs and how often you can do something But the free version is what I use in this In this demo and at the end I will share with you also the link to the GitHub repository with all the source code So if you want to if you want to try this what I'm doing here, you can really just use the free account free developer account for the Twitter APIs and Repeat everything you don't need any special kind of kind of access to to Twitter to to repeat this demos at home and Then the Kafka connect that's part of the Apache Kafka project But it's a standalone component People sometimes think that the Kafka connect and the connectors they are running as part of the Kafka brokers But it's not the case. It's really a separate process separate Java virtual machine which runs the Kafka connect application and The Kafka connect or the connect API That's what's really used to get data from other systems into Kafka or the other way around from Kafka into other systems and the Apache Kafka project itself really contains just some very simple more less example connectors and really what it works with is Connector plugins basically developed by different third-party developers and companies which provide integrations With the different systems platforms and applications So if you want to integrate with something what's fairly common and what others are using it's really quite likely there is already a connector plugin which exists and You have of course also the option to write a custom connectors either for your own systems or For something what nobody used it and the connectors They are always distinguished into source and sync connectors the source connectors They are always used to get data from outside into Kafka and the sync connectors They get the data from Kafka to the to the outside so to some other system and Then in my particular example and demos I will use the connectors which are part of the Apache Camel project I guess everyone probably heard about Apache Camel It's one of the biggest Apache software foundation projects, which has several hundred of different integrations So kind of ways how to connect with some system or platform and exchange data with it And you can use it in many different ways, I think there were some Other talks here at Defconn about something called Camel K for example Which is more about serverless and using it directly, but I will use it in the form of the connector for for Kafka connect and in particular I will use three different connectors which Then leverage the Twitter API and it's a timeline search and direct messages connectors Then Kafka brokers They will be there in the demo as well They are kind of the central part of every Kafka architecture And they are responsible for Distributing the messages from producers to the consumers, but They basically decouple them and the messages are also stored in the Kafka brokers And if you want then you can really store them there for a very long time It's not like they are stored for a few seconds and then pass to the consumer They can be stored for years if you want And the brokers it's also what allows Architectures based on Apache Kafka to provide high availability and And scalability as well. What is what is good to understand is that I Talk here or I already talked about the connect Parts I will talk a bit more about the streams API While this seemed like a separate component inside of them, there's always Kafka consumer and producers So when they connect to the Kafka brokers The broker doesn't really they don't have some special connections some special protocol They are really just the same as any other clients and they Use the same same protocol and the same technology and then Once we have the the tweets the events in the Kafka brokers We can do the processing and what I will use for that will be the streams API Which is really just a library which you can include into your own Java applications. It's not some complicated framework with some workers and some jobs and some Scheduling of the jobs to the workers through the processing. It's really just a jar which you add to your application And you can just use it from there So it doesn't matter what kind of framework you are using or whether you just write your own main functions directly You can always use the streams API and despite it being this kind of simple It has really quite a lot of functionality. It can do all kind of stateless and stateful operations It can do transformations. It can do aggregations. It can do joins. It can do windowing where you kind of process only some windows of the data and the whole thing is also scalable In the in the demos here I will be mostly working with just my Twitter Timeline which isn't that big and it's not really million messages per second But everything I will be showing it really is scalable and if you would want if you would have the use case for it And of course the hardware then yeah, you can really scale it You can run it in many instances in parallel in many replicas and you can get quite a big performance Out of it if you want and then so I said that the streams API you can basically use it with any kind of framework or However, we want in Java I in particular will use it with something called quarkus Which is a framework or Java stack if you want Which is designed for cloud native deployment. So during the design there was a lot of fold put on how to make it start up very quickly and to have small memory footprint and It supports things like native compilation so that you get just one native executable instead of kind of of these Different lib directories with 10s or hundreds of different jars. So that's all Super nice and it makes it quite easy to run these things on on Kubernetes So that's why I'm using it and it also has a has great support for the Kafka consumers Producers and for the streams API. So I of course leverage it as well But everything what I will be showing later in the source code That can be really reproduced just with the streams API itself You do not really need the quarkus for that. So if you don't use that and use something else you can still do the same and Then in one of the demos I will also do a bit of a machine learning and for that I will use this deep Java Library, which is a Java project which builds on top of some other projects such as PyTorch Apache mx.net or TensorFlow and it can be used for for machine learning and deep learning So you can really use it for things like image classification object detection sentiment analysis And to be honest, I'm not the biggest expert on the things like machine learning and artificial intelligence and so on But if you want to start with it, if you want to at least play with it a bit this is a great project which has nice examples and it has kind of this library of pre-trained models which you can easily use and Yeah, that's why I'm using it in this talk as well And then of course to run the whole thing on Kubernetes I will use the Strimsy project which provides the operators for running Apache Kafka on Kubernetes and it tries to make it super easy to deploy Kafka on on cube and it supports all kind of components including Kafka connect and connectors and And all the things which we will be doing today Okay, so That's more or less the theoretical introduction Into the different things I will be doing and different projects I will be using now before we actually do something we will need to deploy kind of the backbone We will need to deploy the Kafka cluster with the brokers and and the Kafka connect so let's have a quick look at how you can do that and To be honest, I have it already deployed here and running and that's not necessarily because I Will be afraid that it would not work in a live demo or it would take long time It I'm quite confident that it would work and that it would be fairly quick But I have it already deployed for around a month I think since late Christmas or beginning of January so that it can collect all the different tweets from my timeline During all this time and we have a bit more data to analyze because if I would just deploy it right now Then we would need to rely on my timeline actually getting some tweets Now in the few minutes during the demos So that's why it's already running and you can see that I have here three different pots on my Kubernetes cluster with the with the Kafka brokers. I have a zookeeper here. I have The connect cluster deployed here as well, and I have some other tools to help me manage and monitor the whole thing, but also I have it already deployed. I will just quickly show you how you can Deploy it. If you already use some operators, it's probably not completely new to you But you use this kind Kafka custom resource which works kind of as the blueprint for your Kafka deployment and you describe there exactly how the Kafka cluster should look like so I can for example say here Okay, I want to have three replicas of my Kafka brokers you can specify the resources which You want to have in your cluster as you can see in my case it's really running on my home cluster and it's not not too big deployment, but Yeah, you can use the same Projects and tools to run something on dedicated hardware with Hundreds gigabytes of memory and so on if you want. I also configure some some JVM options What's really great is that I can also configure the listeners so the ports where the clients will connect including things such as authentication and I have your authorization as well and What else do we have here? We have the storage configuration. We have the metrics configuration There's built-in support for Prometheus metrics for monitoring and then pretty much we do the same for zookeeper as well And basically we create a resource like this and then you just do cube cuddle apply on this and the streams the operator will take care of the rest and deploy all the clusters and configure everything and Very similarly, we also deploy the the Kafka connect Cluster as I said inside the Kafka connect cluster. It's really just a consumer and producer. So I create this this Kafka user Which it really used to authenticate and I give it some ACL some access rights, which it needs to to use the right topics for storing the configuration, but also to Publish the tweets to these topics and read from them and so on and then this is a part which I have commented out but As I said the Apache Kafka project itself. It doesn't have that many connectors So you need to add the connectors which you want to use Into the deployment and for that what the streams they do for me It will build a brand new container image and it will add there the connectors which I tell it that I want to use so that's why for the actual deployment you would need to specify some Docker credentials to the repository where you want to have the Docker container or the container image deployed and then so that's commented out so that I just don't share the credentials and Similarly, we will also need to have a secret with the credentials for the Twitter API Because that requires some authentication. So the connectors will need to use this to talk with the Twitter API and then we have just the The Kafka connect deployment, which again kind of follows the same principle it How many replicas it have what are the resources the configuration here we configure how to connect to the Kafka cluster and Authentication and so on and again, you would do kubectl Apply on this to deploy it and then when it's running you can also do something like kubectl get Kafka connect OYAMO and you can see that it's already deployed and that it has the connectors which I want to use Edit there. So all these Twitter direct message connector Twitter search connector Twitter timeline. It's actually I skip that In in the YAMO, but in the Kafka connect resource, there's this build part Where I specify all the different Connectors which I want to have edit into redeployment So Twitter search Twitter timeline Twitter direct messages and they will be all automatically edit there so that's the that's the Kafka cluster and the connect cluster and We have it already up and running so now we can move to the actual demos which will be a bit more about code and it will show Something a bit more interesting and The first one will be what I call timeline world cloud demo. So what we will do is we will deploy Connector into Kafka connect which will read the tweets from my from my timeline So these are the tweets or retweets from the accounts, which I'm following If you don't use Twitter and then we will use the Kafka streams API to analyze the tweets and try to find out What are the topics? I am most interested in what are the hashtags or the Twitter handles mentioned most commonly in my timeline, so let's First check, how would one deploy the connector? so Again, I have it already deployed and running but this is kind of the YAMO which would configure the the connector first we have to create a Kafka topic where the connector will send the messages and Then we create the Kafka connector resource which will actually configure the connector itself And you can see we say okay the class which we use for this connector It should be this comma Twitter timeline source connector if you remember its source connector So it gets the data from Twitter into our Kafka broker and then we Also have to specify where does it find the credentials for the Twitter API Which will be in the environment variables met from the secret and then we also tell it to send the messages in as As Jason and when I do kubectl get Kafka connector with their timeline You can see that it's ready. It's already running and when I quickly switch Into the Browser and check my Grafana and check the Kafka exporter and let's find the Timeline topic then you can see that it has already almost 5,000 or that's the offset here that it has around 4,843 messages right now. So that's the number of tweets which it basically collected since since I deployed it for the first time and Then when I switch to my Intel EJ We can have a look at the streams API application which we'll use to process these So let me zoom this up a bit into the presentation mode So this is the POM file for my For my application which is Doing the data processing as I said, it's using this this park was framework and It also when I scroll down it also loads the the streams API dependency but as I said if you would use some some other Framework or just pure Java without any framework then the code will look pretty much Pretty much the same. It's really just Kafka streams what I'm using here and The most of the code is actually in this In this build topology method This is the method where I create the Kafka streams topology where I basically take the Kafka tell the Kafka streams API how it should read and process the data and First I need to define something what's called Saturday So Saturday stands for serialization deserialization and that's kind of a class or object Which the Kafka streams API is using when it's reading the messages from Kafka It's using it to deserialize them and when it's sending them back to Kafka It's using to serialize the messages. So with this this tweet Saturday, which I have prepared here That's basically what helps Kafka streams API to take the Jason's which it's receiving and decoded Directly into some Java objects so that I can easily use it in the application and Then in this case what I will be doing is I will be counting some data And I want to store these data and want to query these data later So I also prepare these store suppliers which kind of create Kafka stream stores They will have some local caching done locally on the in the pot where I will be running this application But like all the data which I will use there and the aggregation results They will be also stored inside Kafka. So when the application restore restarts Because I don't know some upgrades or for whatever reason it is basically able to reload the previous state from the from the Kafka cluster from the special topic and then continue in the processing and Then I use the actual stream builder to build the the stream processing pipeline So I always start with a stream which basically says okay now start reading this this the data from this topic and use these service to Encode or decode them. So this is the first one. That's kind of further for the message key Which we don't really care about. This is for the message payload. That's where the actual tweet will be and then We will get basically a stream of the tweets coming from the topics and we can do the processing using this streams API DSL language so what we will for example do here is in this flat map values We will take the tweets we receive and we will extract the The text from them because it's really the text we are interested in here It's not about all the all the metadata about what application was used to send it What was the place where it was sent from and so on so we extract the text here And then what I do to find out what the tweets are really about is I will basically split it into words So I use a regular expression to to split it into the individual words and then I do some filtering. So I'm not interested in words, which are too short. I'm not interested in any URLs and I have a kind of I don't know 23rd the list which I have on the ignore list Which are kind of these common words in English like this that then and so on which are in many tweets But which actually don't mean anything on their own. So I'm ignoring this and then I basically go through this and group all the words Which are in the tweets and then I'm going to count them So as a result, this will basically tell me that in the tweets in my timeline The word Kubernetes has been out of the almost 5,000 tweets has been I don't know 200 times for example And I do this counting twice. I Once I'm interested for the whole time. It's running. How many times the words were there and And what's there really the topics there, but I'm also using this windowing I mentioned before To basically tell me, okay, let's not care about what was month ago Let's look at what was happening in the last one hour So this will look into one hour windows and we always move to window every minute basically So this will basically do the counting not since I started this application But only for the last hour and that will kind of tell me what were the main events happening during the last hour and Then I really just build the topology it will start reading the messages and it will start doing all the all the processing form and What I have here so that I can actually see these data and look into what are the results. I Have here the simple rest endpoint which Can give me two basic things so I can ask it for the all-time top 20 30 50 words and it will basically give me the the number of words from the top of the table which are most common in in my timeline and Similarly, I have another one for the latest word. So that's for the window counting So that's for the most common words in my Twitter timeline in the last hour and if you check what this interactive queries is doing then That's actually the store if you remember the store suppliers we created on the beginning It's using these stores and it's basically querying the calculation results the counts results And it tells me what the current results are and what are the most common? common results and When I get out of the IntelliJ and switch to my to my browser Then this is actually the result and if you know me a bit, you know that I'm a football fan So this AVFC and villa. That's the Aston Villa football team, which I support So there's a lot of tweet about that obviously in my timeline But you can see here also Kubernetes for example, there's there's cloud Well the new year 2022 that was probably quite big on the beginning of January So that's there as well and I can also check just the just the latest tweets Which kind of show what was in in the last Hour so that you can see gives us different results And what I didn't have here as well is this is this tech cloud So the the boards maybe they show quite a lot of about what I'm interested in But maybe the things like the hashtags or the mentions maybe they will show a bit more so we can see again the Aston Villa football club is the top place here, but there's Kubernetes I guess somewhere there might be Defcon as Well, I don't see it right now But it was there Yesterday when I was doing the rehearsal. So yeah, you can see that this way you can easily take and visualize your Twitter timeline and you can easily see on what's going on in the timeline and Yeah, I kind of know what I'm following of course, but I I can't really say that Yeah, the cloud with the words Does really well job to represent What are the tweets I'm following and what I'm interested in so let's Move to the next demo and let's Bring it to another level so now we were just kind of getting the tweets and and analyzing it and Yeah, it might be interesting to see that but at the end I'm probably the only one interested in what's in my timeline and I already know that so let's see if we can do something a bit more practical and So imagine the situation that we have some open source project or maybe a company or your personal account and Yeah, people might be tweeting about you and you might be interested Maybe if someone tweets something nice about your project Then you can maybe come back and retweet it or say thanks for the kind words And maybe if someone tweets something bad about your project that they were unhappy or about your company Maybe you can get back to them and try to find out what they didn't like what should be improved and so on Then that's what we will do here in this second demo. So what we will do is we will use a Connector which will search for some keyboards on the on the whole Twitter and We will use again the Kafka streams API to analyze these tweets, but we will now employ the machine learning Technology with this deep Java learning library and we will use the sentiment analysis Which will tell us whether these tweets are positive or negative and kind of how much positive or how much negative They are and when it identifies some tweets, which are really Strongly positive or strongly negative it will Send us a direct message on my Twitter and tell me oh, hey look there is this tweet You know, that's quite negative about your project. Maybe you want to have a look into it and If you want to if you want to take part in this then you can send some tweets with this hashtag B y o s ma as in build your own social media analytics And if you use this hashtag then it should reach Read it should match that the search and reach into my application running here and we might see it in the in the demo pop up if it's sufficiently positive or negative, so if you want to if you want to try this then you can give it a try and I'm going to switch again to the command line and again, I have it all Deployed but just to quickly show it so I have this this search connector, which is searching for this for this hashtag which I mentioned but I have also the Sync connector this time and the sync connector is The Twitter direct messages sync connector So that's what we'll be reading the data from a Kafka topic And it will be sending them as a Twitter direct messages to my account So don't get confused this weird number is actually the user handle for my for the Twitter account I'm using for development and for demos so that's the actual Twitter account and then we have again the credentials and we have the topic which it is consuming and it will read from this topic and and send me the direct messages as alerts and We can also have a look This is the actual deployment I didn't show that for the previous demo but the actual application is always running as a pot inside the Kubernetes cluster as well so I again create the Kafka user and then a Regular deployment where I just point it to the container image with the park whose application and I configure it very to connect I again configure the users authentication encryption and so on and that's really just running as a deployment In my in my Kubernetes cluster So this is the sentiment analysis one these were the tuck and word clouds We can also check the source codes of this Of the sentiment analysis application So if I go to the palm file what you can see here, it's It's again, I have all the quarkus libraries, but I also include these These AI dot the DJ L. So that's the deep Java learning library And I include the model zoo which contains the pre-trained model So I didn't really train the sentiment analysis on my own. I'm just using one of the examples and then When I go to the to the topology It should be really familiar to you. I Again create the tweets are there for the decoding and then what's new here is This is the part where I load the The deep learning model which does the sentiment analysis, which does the prediction And then I really build the stream application. So I again read from the Kafka topic And then I filter out the tweets which I'm interested in I take the text of the tweet and then I run this predict method of the predictor and That's what does the sentiment analysis and it returns me this classification object which basically says How it was classified whether it was positive or whether it was negative and it tells me the probability So kind of how much positive or how much negative it is And what I'm really doing here if the probability is more than 90 percent Then I will basically take the URL of the of the tweet which was analyzed and I will prepare some additional message which says the following tweet was classified as So this will be replaced with positive or negative and I give you the percentage as well and I will send it to Another Kafka topic and this is the alerts topic which the the sync connector is reading and which will be sent to the To direct messages. So then I would again just need to compile this build a container image and deploy it and Here is my Twitter account and we can try if it works So normally I would like now type some message how the better was nice and sunny and so on But to be honest today here in Prague outside my window. It's really cold and raining So it's not that great. So let's try some negative message today. So the weather is variable today It is freezing cold and raining and I have to add the hashtag. So build your own social media analytics and now I can just tweet it and now the the search connector should find this from the Twitter API the Streams application should analyze it and if everything works. Well, I should see you here in my in my messages the the information and you can see this is 30 seconds old and it tells me okay the following tweet was classified as negative with 97.60% probability and it gives me a link to the to the tweet Which is the tweet I just sent the weather is terrible today. It's freezing cold Blah blah blah. So we can see that it that it worked Hopefully if you use this for your project or for your company you will be getting only the positive notifications and not the negative ones but Yeah, we can see that the sentiment analysis works quite nicely and it identified the tweet and send me the notification and I can now Yeah, talk to this guy to stop complaining about bad weather so now The last demo I have is a little bit different. So now we were for for in all these previous demos we have something what we deployed and what is running and running and running and Now I will try to do something different I will more try to do some ad hoc analysis and just play with a stream of the tweets and What I will do is I will try to confirm some idea or some some Hypothesis whether it's true or false now Gunnar more link who is one of the authors of the Debezium Kafka connector for change data capture, which you should check out if you are interested in databases and Kafka He always says that when you tweet something you should always attach some image or a video because such tweets They kind of get more attention more retweets and so on. So let's see if we can if we can confirm this if this is true and And So I have here this this ad hoc application Which what it really does is? it's reading the feeds from my from my timeline and It's doing this aggregation. So in the previous in the first demo we use this counting Which is aggregation as well, but it just really does the counting. This is a custom aggregation where we basically Get the tweet we check whether it has some image attached or not and We count how many retweets they have if they have some image attached and how many retweets they have be doubted and then we count the the average and In this case, I'm not going to send it to Twitter as direct message I'm not going to send it to another Kafka topic I'm just using this peak method to really log it into the command line and see what the result is So let's switch to the command line And So because I'm still using Clarcus. It has this nice command Clarcus Def which you can just use to kind of run the application in this demo where it's automatically recompiled and You can easily kind of run it and debug it. So I'm using this to run it now It will take a while to start it will connect to the Kafka broker It will create some helper topics for the aggregations and you can see it's now running and it's doing the calculations And I guess now it's probably finished. So what we can see is the last last result is that out of the almost 5000 tweets in my timeline it found 817 tweets which had some retweets which had some media and Average number of retweets was 9.9 so almost 10 retweets if you had some picture or video attached and then it found almost 1500 tweets without any media and the average retweets there were only two retweets. So Yeah, it looks like when Gunnar says you should attach picture or something to your tweet to get more attention It's definitely true. It's The Example here 5000 feeds it might be too small, but yeah, it kind of suggests that he's right and you should Always always do it. So yeah, that's kind of One of the experiments you can do and you can of course if you want you can play with these ideas a bit more you can Try to find out what's the right time to publish a tweet Where the people tweeting about you are living or what apps they are using and so on and you can of course also use the other social media networks as well and That's for the end of the talk here are the links of the different projects Which I used during the demos and more importantly here's the link for the github repository Which has all the source codes and the YAML files and so on so if you are interested in this to Try it out and see how it works That's where we can find it and I hope You learned how to do some some cool things with with Twitter But you also learned that Kafka makes it really easy to just take the different parts of the ecosystem and Build them together into fairly powerful applications, so That's it and thanks for watching and listening Thank you, I could it was a great presentation We still have some time, but I don't see any questions So I think maybe if someone is interested to connect with the ACOOP you can try in the work adventure And otherwise, thank you and that's it You