 set up for what I'm going to be talking about between what Mark talked about in Evan. Well we're going to be talking about streaming big data with it's kind of team Apache. It's Apache Spark, Apache Cassandra, Apache Kafka and then I'm going to get into doing that in the context of Scala-Oca applications all for big data. My name is Helena Edelsen. I recently joined data stacks and I'm a senior software engineer at on the analytics team. So breaking down the talk itself we're going to talk about it's kind of a step back of why are we talking about this at all delivering meeting in a near real time and I say that very specifically at a high velocity because Evan already talked about sort of an intro to Spark. I'm just going to cover that in like one or two slides and then have a very very quick intro to Spark streaming. Cassandra, I think I have one slide for Kafka and Oca and then we're going to get into specifically more of the integration part of my talk with the Spark Cassandra connector and then finally which I think is the best part getting into some of the code. So we're going to walk through an application, a reference application that I've been working on. I presented it at Strata New York City recently but it's sort of a work in progress and it shows you how to actually you know build like a very full sample of a Scala-Oca application trying to do both streaming and not streaming in Spark, streaming you know data from consuming it asynchronously from Kafka, doing your computations across the cluster, across data centers and writing some of that data into Cassandra or reading from Cassandra doing your computations and you know etc. Okay. I'm going to do a quick slide on myself and then I have some questions for you guys. So I've been using Scala since 2010. I'm a Spark Cassandra connector committer. I've also been a committer on Oca, particularly Oca cluster. So I wrote the original adaptive load balancing router and the metrics API that that it runs on and since then Patrick's done just an incredible job with that. I've been a senior cloud engineer for a long time at VMware and some other companies. So a little question about you. Okay. We already did the survey of who's working with Spark. How many of you are working with Kafka in production? And how many of you are working with Cassandra in production? Excellent. Love it. And I don't have to ask who's working with Scala. It's really great. When I did this at Strat, I was like who's working with Scala. Everyone was a Java developer. Those are not my people. So this is a very basic just starting out with one side of code, a very basic streamlined Spark word count. This word count blueprint, all that's doing is setting up the Spark config and the Spark context. But you can see that it's especially with whatever I just talked about, it's incredibly streamlined. And what it's doing is very, very powerful. So it's reading, taking that SC is the Spark context. It's just ingesting a text file. And with Spark, you can ingest that from, you know, S3 from HDFS, from basically any kind of input you want, doing your Spark computations. And here we're just doing a word count and then writing it out to a file. Very, very simple. But that's like five lines of code, which I think is really sweet. It's a very, it's very elegant, which I think is one of the reasons why we all like Scala. So delivering meaning in near real time at a high velocity. Why are we even talking about this? I really like this cartoon. I think a lot of us feel the pain of this slide. I won't read it because it, well, it covers up a curse word. But I'm trying to keep this PG. So in creating highly distributed, fault tolerant, concurrent scalable systems, there are some key strategies that emerged. And I have to apologize because I do need to use my speaker notes. I forget everything when I'm in front of an audience. So some of these primary strategies that I'm sure all of us have to battle with or resolve are things like partitioning for scale, replication for resiliency, share nothing, which is something you really have to think about asynchronous message passing, parallelism, isolation, location transparency. And I just am mentioning this because I wanted to make a point of why am I talking about these particular technologies? Because of course, as we all know, you can solve this problem many different ways. So what do we need? I think this is a really good question. We need fault tolerance. We need to replay from any point of failure. We need to have no single point of failure, of course, very important. With Cassandra and Aka, this is all built in. And particularly with Spark, the Spark RDDs are built to be fault tolerant. They replay in parallel on another machine. And when there is a failure, what Spark actually does is it doesn't need to replay the entire computation. It is aware of where the failure occurs. And it will only have to reread and it reconstructs and then rereads the RDDs. And it does that because it has this notion of lineage. We also want a masterless system, which Cassandra and Aka cluster have. They both use sort of a dynamo style. Both, again, Cassandra and Aka cluster use the gossip strategy throughout the node ring. So that's how it sort of reaches consensus of the status of all the nodes, what nodes are members that are up or downed or unreachable, et cetera. We also need to be able to do things like span rack and racks and data centers. So for this, well, Cassandra has this built in. That's actually one of the things that a lot of no SQL, highly distributed or scalable databases do not have, but Cassandra, it's built in. Something that also Spark, Cassandra and Aka cluster, both they all do, which I think Mark was talking about, they all do this notion of partition awareness through hashing and most of them use murmur hash, but I think some of them you can configure a different hash algorithm if you want. We also need to be doing things like replicating across multiple data centers, providing low latency and being able to survive regional outages. Both Cassandra and Spark are partition aware. And actually when I get into the Spark Cassandra connector, I'm going to explain how because Cassandra is so node aware and partition aware, it actually, it feeds that partition information to Spark, and that is how Spark does the partitioning itself. Kafka and Aka have a different notion of partitioning. I think for, it's on the roadmap in Aka cluster where it's going to be by network and Kafka, it's a little bit different. It's a different sort of notion of partitioning that it uses. We also need to be networked to apology aware, which Cassandra is, and of course be able to scale to as many nodes as we need whenever we need, which Cassandra can do. My one slide about Aka. Who here is using Aka? Excellent. Love it. So Aka fulfills a lot of the requirements that we need for building these kinds of systems successfully. It has location transparency in the actors. It's highly fault tolerant. What I really like about Aka is it's very, very configurable, the behavior of your fault tolerance that you want. And it's very easy to do that too, but it really takes a lot of thought to implement. It has it built in asynchronous message passing built on Erlang, or the concept of Erlang, how that works. It's non-deterministic, shares nothing and has actor atomicity within the actors themselves. And here's my one slide on Apache Kafka. So it's high throughput distributed messaging system. We're primarily going to be looking at Kafka in the code as far as integration with all these systems. So I'm just going to go with this one side and call it good. So as far as thinking about something that we hear a lot about now Lambda architecture, what this whole picture really starts to give us is a sense of what are some of the technologies that we could actually develop with and deploy that could give us some of the requirements that we need to be successful. So here's all of our spark components, which I'll go into a little bit, built on running on Cassandra with Apache Kafka and then being able to use really nice tools like writing our code in Scala and using Aka makes it all very nice. So this is my version of Apache Spark and Spark Streaming the drive by version because we're going to do this really quickly. So what is Apache Spark? It's a fast compute engine for big data processing. A lot of people have this false sense that it's a new technology, but it's actually been around since 2009, came out of the Berkeley AMP lab. It's been fully open sourced the entire time. I can't remember when they became a fully fledged Apache project, but it's been several years, I think. It's built primarily on Scala. I think there is some Java and sort of made me cringe that they still use Maven files, but whatever people. I actually created a ticket to say, you know, and I said I would do the work too. Can we just like, I'll use SBT here and they decided no. Enables low latency with complex analytics. Most, I think Evan, you mentioned this too. It's, it is the most active open source project in big data on GitHub. And actually they break GitHub fairly often. So they had to write some hook scripts so that I wouldn't bring it down. And I don't know if any of you have ever been on the Spark user list, but I sort of lurk there to answer any questions about using Spark and Cassandra. And I mean, it is the most active user list I've ever seen. And really, really good questions too. It's really interesting. One nice thing too is Spark has sorted the same amount of data three times faster using 10 times fewer machines than Hadoop. And that's hopefully the only time I'll mention Hadoop while I'm up here. And one nice thing too is that they have, this is pretty cool, I think, they have releases every three months. So you're not ever waiting a huge amount of time for bug fixes or new features. It's a very fast developing project. The API itself is very easy to use. I'm not going to go really into the deployment of it, but you basically, your Spark applications that you write, you deploy them out to your Spark cluster. Some of the goals that the creators of Spark had when they were developing it was they wanted to have something that works well with gigabytes, terabytes or petabytes of data. Be flexible for all of that. Works well with different storage media, RAM SSDs and HDDs, for instance. They wanted it to work well with from low latency streaming jobs to long running batch jobs, which not many technologies can actually do, and easily join data sets from disparate sources, which it does very well. So the Spark components, there's the Spark core API, and then built on top of that that leverages that core API or the Spark streaming, which it's actually not real time. It's near all time. And I'm going to go into that just a little bit. It's basically micro batching, but it's very, very easy to just configure your windows of the batch. They also have Spark SQL, and you can actually use that with many different data sources. That's, it's a pretty cool project. MLib is for the machine learning. So it's very, very easy to write your Spark applications and do machine learning, especially coming from Scala, because as you saw some of Evan's samples, the Scala collections, and the Spark MapReduce work is very, very similar. And they also have GraphX. So this is like the core ecosystem for Spark. And it integrates really well with some of the higher level tools. This part I think was kind of interesting why they decided to choose Scala. They looked at many different languages. For one, it's functional, concise programming interface, which is one of their requirements. It integrates naturally with the JVM, which they kind of needed to do to be able to integrate very easily with Hadoop environments, because they wanted people to be able to start using Spark, and not necessarily have to throw away Hadoop systems, but maybe slowly over time, move Hadoop out and, you know, replace it with Spark, because you can, I should mention this, you can use Spark. And right now there's three different cluster managers, you can use Spark from a deployment sense as a standalone deployment, cluster deployment, you can run it on Apache Mezos, or you can run it on Hadoop Yarn. Oh yeah, this one's the most important, I think, and on the JVM, it's the only language that would offer, wait, I said that, it's the only language that would, that they found that was able to capture functions and ship them across the network. And actually, I know that there's someone, I think it's Heather Miller, she's doing a lot of research with that particular aspect of Scala, working with the Spark team. And there's this, this URL, it's a blog post, or it's a user list post that Matei Zaharia, the original creator of Spark, where he explained why they chose to use Scala in the end. Okay, so Spark streaming, comparatively, we're talking about working with zillions of bytes versus gigabytes of data per second. It's an extension of the core API, it enables high throughput, reliable processing of live data streams. And you can express sophisticated algorithms easily using high level functions to process those data streams. One important function of this term member that it does exactly once message guarantees, which is pretty difficult to do. It integrates with many different types of data sources. So it's extremely flexible. And when you start to look at these APIs, the streaming API, as you see, which I'm going to show you in the code, it's very easy to do some of the common use cases. And I mentioned this one because I've been surprised I've been asked this question before by people. Why would you actually use Spark streaming over the standard batch functionality? My last company I was at was a cybersecurity company, and I was building data pipelines with different sort of technologies. I built one API using that worked on Hadoop using Scalding, because everything we were doing thus far was in Java doing pigjobs or Python doing pigjobs. And I don't know if you guys have ever seen that stuff, but the code is just scary. So, you know, I was able to write a really nice streamlined, very, very concise and fully automated, you know, pathway to do the data processing that we need. But then, you know, I needed to do some real-time processing. I wanted to get back into using Acca, but still be able to do like real-time computation over many different data sources, different data models, and be able to sort of glean particular information from this work. And I started to look at Spark and just was amazed by it. Very easy to start working with, too. As Evan mentioned, there's a REPL, which is just like working with a Scala REPL. So you can easily do things like, you know, site analytics, intrusion detection, fraud detection, you know, things where you might need to have that data right away versus batch jobs that run, you know, daily or hourly. When you need that data and you want to make some kind of real-time action based on it. So what Spark Stream is primarily built on is two concepts, really. One is the DStream, which is microbatches. It ingests and processes the data in the DStream lazily. The input data is received during each interval and the interval is configured by you. And then it's stored across the cluster, forming an input data set or what we call a microbatch for that interval. You can do that in, you know, microseconds, seconds. You might want to look at a window of days, whatever you need to do. And from that, computation, it then produces a new data set that you can continue in the stream to process and do further computations with. And then the second part of Spark Streaming is the input DStreams. And I'm going to show you what that looks like in code. But that's, for example, where I can have an ACCA actor, like a Kafka Streaming actor, which I'm going to show you, where I want to be able to asynchronously stream data from a Kafka topic or multiple Kafka topics, do some Spark computations from that Kafka Stream, and then, for example, write that data to a Cassandra table. And we don't have this yet, but I'm actually adding a Cassandra input DStream for the connector because right now, you can read and write to Cassandra from Spark in the Spark context and the streaming context. But the ability to be able to asynchronously consume data from Cassandra would be really nice. And these are the modules that are in Spark Streaming right now, although do I have Kinesis? There's Kinesis. Oh, yeah, it is in there. So flume, Kafka, Twitter, zero MQ. And I think someone has a community version for RobinMQ as well. So this is my one slide on Spark Streaming window operations, but a lot of, like, say, for example, you have a use case of working with time series data, and you, you're aren't so interested in the timestamp of that data, individual elements coming in, but say you want to be able to do samplings in like five second windows of time. So this is an example of doing, I don't think this is a word count, but just being able to flat map over that data, doing the reduce by key and window, and then you specify what is my window duration and the slide, so the interval of that. And then take that data and you can save it to Cassandra. It's just very, very concise. And this is just a sampling of some of the major window operations that you can do with Spark Streaming. So I think I have a count by value on a window. I'm going to show you group by key and window. And again, you can figure what those windows are in your code very easily. Okay, Cassandra. Ten things about Cassandra. This is again sort of the drive by version of Cassandra. Since it seems like a good amount of you are familiar with it already. Basically, it started, it's a project that started at Facebook based on the dynamo, the original dynamo paper to be fully distributed and also based on Google's big table. It's a no SQL data store and it has no single point of failure for max uptime. It's masterless. I think I already talked about, you know, some of the functionality with regard to elastic scalability. Oh no, I'm talking about that next. Sorry. But it's a linear scaling database so that you can add more nodes as needed whenever you need to. So ten things about Cassandra, and I think I actually put 12. There's no concept of master and read replicas. At data stacks, we test over 1000 node clusters every day for days at a time. It has structured or rather it allows structured, semi-structured and unstructured dynamic data sets. So what's different about that versus working with RDBMS is you're not locked into a particular schema. You can start with a schema and I think later on I'm going to talk very briefly about Cassandra schemas and the data model. But I forget where I was going that. But you can change it in runtime. You can, you know, do an insert that maybe doesn't have to have one of the columns or you could add some more data. It's very flexible. Backups take less than one second. Adding a column to a table will take the same time with one row of data or a billion. New nodes automatically add themselves to the cluster. The data stacks drivers discover new nodes automatically. You don't have to do anything. Any kind of config yourself. You'll never have to shard again, which is good because Cassandra does that behind the scenes for you. It's, every right is durable, replicating across multiple data centers provides lower latency so that you can survive regional outages. Another nice thing is just because if you have like nodes down does not mean that the standard databases down. And you can lose over half your cluster and still be online and just fine. It's also a perfect platform for mission critical data and the largest production deployment that we know of has over 75,000 nodes. And I can't say which company that's in. And the last slide is just that it's very, very easy to use, which when I had worked with Cassandra before coming to Deusdax, I think I wasn't quite aware of how easy it was. But I'm not sure when CQL came out. But once I started working with it more heavily, I was just amazed by how easy it was. You know, using CQL, for example, because I was working, I think I started working with it using the Netflix API, which I think no longer does, but it uses the thrift protocol. And it was just very, very awkward. It's it was at the time written in Java. There were a lot of threading issues. There were there was a lot of blocking. So if you're working, for example, in a Scala application that's built in Aucca, working with it, you know, a blocking Java API is just, you know, nightmare. So just a quick sample of what some of the CQL looks like create table. We're all familiar with that, you know, primary key, insert into and the values just very, very easy for programmers to get up and running with. All right, this is a bit more in depth, going to try and still quickly talk about the connector so that we can get right to the code, which is a good part. So Spark on Cassandra. There's direct integration with Cassandra with the Spark Cassandra connector. All of the select and where filtering that you want to do is done server side. It's never done on the driver locally. It's very data locality aware, which is going to help you with speed. And I'm going to talk about that a bit. Co location. And it's a really great choice for doing things like time series data for Spark for the connector. So what it does is it will read data from Cassandra to Spark and vice versa, you know, read from Spark to Cassandra. It could do it all in one computation as well. You can do that with Spark streaming, where reading from a Kafka stream, doing Spark computations right into Cassandra and the other way. All Cassandra types are supported and converted into Scala types. It's all all the computations are done on the server, which is it. You can use it with both Scala and Java. We do have a very small Java API, but it gives you all the same functionality. It's compatible right now. Our beta two 1.1.0 beta two just came out. And it's compatible with Spark 1.1.0 and Scala 2.10.4 right now. But one of our next tickets that we're doing is going to make it compatible with Scala 2.11 and 2.12 coming next. And as far as Cassandra versions right now, you can use it with Cassandra 2.1.1 and 2.0. It also offers virtual node support. So just a basic diagram of working with the connector. You have Spark executors, which then feed into the data into the connector. And then the connector right now is using the Cassandra Java driver. But I'm going to be starting work on a Spark on a Scala Cassandra driver soon, which actually is going to be taking a lot of our the code that we have that does driver work in the connector and moving it out into a community Scala driver. And hopefully the plan with that is then we could also add some a new module for SLIC to use the Scala driver to and other projects as well. Anyway, so it goes to the Java driver at the moment and from there into Cassandra. And with these two, you would have a Spark node and a Cassandra node on the same on one node. And that way, the data transfer happens much faster. And another thing, I think I talked about this too in another slide, but the way that the connector works is it will first try to do local node first. It's and it's never going to try to talk to something outside of that data center. So it it has these algorithms built into it so that it can be much faster. So working with Cassandra, one of the things to really think about is your use cases. Like unlike working with RDBMS where you first, you know, do these massive, you know, printouts of diagrams of your, you know, city size, you know, data models beforehand, what you really want to do when you're working with Cassandra is to start thinking about your data first. What does your data look like? And then start to think about what your queries are going to be. And based on what your queries are and what the data structure is, that's how you would really start to build out your Cassandra model because you're thinking in a completely different way you're thinking in terms of highly distributed multi data center situations where you've got to think about latency, you've got to think, you know, everything has to be very fast. So you have to construct these data models in such a way that everything happens very fast through these massive, massive systems. And with Cassandra, it's pretty easy to do, which is nice. So two quick slides, one on the Cassandra data model and the next one on the Spark data model. So here, for example, we have one table. So we have, you know, or these are two of the same partitioning keys. Here's your primary key. And then with Spark, as Evan mentioned, it's resilient, distributed data set, the RDDs. Those are what handle all of the data sets that are immutable, iterable, serializable, distributed, parallel, and they execute in a lazy evaluation. So here's a quick sample. Can you guys see that alright? Of some code. So here we're just setting up first the Spark comp. And through that, you can pass in your like all you need is one Cassandra starting connector point. You don't need you can pass in a comma separated list of seed nodes if you want, but it just needs to know one, you know, one of the nodes to start to connect to Cassandra. And then it'll figure the rest of it out on its own. Create the spark context or the spark streaming context from that. And then I wanted to show this particular line because here we're starting to use the connector. You can call the standard table using Spark. So I just need my key space. And automatically, it's going to come if you don't modify yourself, which I'm going to show you next comes back as a Cassandra row. But we do a lot of type inversion for you so that all I have to do to make it, you know, be sequence of my particular case classes is just pass that in instead. So here I'm setting up a Kafka stream, Kafka utils dot create stream. And I pass in my key and my value types and the decoders. And here we're just doing a very, very simple setup with strings instead of like arrays of bytes or whatever. And I'm just saying passing in these for the Spark streaming context, my Kafka params, the topic. And then here just in one very elegant line of code, I say stream map, I'm going to get the values. And then I'm going to do a count by value for a word count, save it to Cassandra to my key space and my table. Very, very simple. Here's a different sample that does uses the Twitter stream, where I again, I have my Spark context, I set up my streaming context from that I create my Twitter stream where I pass in some kind of you know, my Twitter off. I can set filters or not set filters. I said, you know, the type, the level of Spark storage that I want. I can have some little transformation functions, you know, very, very flexible. Here I'm just like pass, you know, I'm going to get out all the cruft and, you know, that happens in tweets and then pull out the hashtag that I might be looking for. Maybe I'm looking for some kind of suspicious data by particular users that I'm watching or something like that for security use use case. So then I can just say stream flat map over it. I do I map my transformation in there count by value and window every five seconds, transform it again where I'm taking each of the RDDs and the time and the time there is indicating the batch time. And then from that, I want to be aggregating the terms that I've got account of those terms, how many I've seen, I've seen of those and what the time was. And then I save that off to Cassandra, a particular key space and table. And here I'm using the sum column. So I'm designating. I only want to save these particular columns. Yeah, you can working with historical data with streams is difficult. And I'll show you that in the code sample that I have. That's that's difficult. So here reading from Cassandra to spark. We just call Cassandra table. We can do our selects, our where clause, all of that's happening on the server. I'm going to skip through this. But this is just an example of what our Cassandra RDD looks like. It's just really implementing the RDD interface. And does all of the work for Cassandra transformations in the background. One of the first things I did in this repo was start to add the streaming functionality. So I added this Cassandra streaming RDD. Paging reads how much time do I have, by the way? Okay, they would be fast. You can very easily page. I'm going to talk fast. You can very easily page your reads. When you're doing the calls to Cassandra table, it's just a configuration setting basically to say, you know, the default is like 100,000. But you can set it to, you know, say I only want to page like 1000 items at a time. Spark is going to iterate over that lazily. This is so co locating for performance purposes. I'm going to skip over that. I really want to get to the code to show you guys. Oh, this is good for our scholar audience. So a lot of the type conversions that we do, you can basically use any case class. Everything just has to be serializable for working with Spark. So it's going to basically do all of the transformations for you. So I can say case class tweet my own case class. And then I can do here we have, this is what I wanted, if you have all the fields designated in that table and are mapped, you know, to that case class, that's all you have to do. The connector does the rest for you. You can also, instead of case classes, you can map things back to tuples. And there's a way to do some mapping to handle unsupported types. It's sort of a very small API that you can do the mapping yourself if you needed to. And where to get the Cassandra right on GitHub? Data stock slash Spark Cassandra connector and adding it to your project right now we're at 1.1.0 beta 2. Okay, let's look at some code. Okay, so this is an application that's up on GitHub right now. It's called killer weather, K-I-L-L-R weather. It's a time series application working with time series data. So we picked doing some government weather data so that we could have things that were, you know, the raw data comes in and it's by weather station ID, by year, month, day, hour, and then all of the different, you know, values for each weather station like precipitation, temperature, et cetera. So this is just a basic object that we're going to start. This isn't deployable right now. It's just you can just run it through SBT, which hopefully I'll have time to do for you. So I'm setting it up from the start and this is just a standard. Let me just do it and tell a day. Sorry guys, I should have done this before. Is that better? Okay. You can also see this on GitHub. All right. So bring up the application. This is just a standard type safe config file where I've got a configuration file. The way I have it is so because this is this just seems like the way that anyone saying would do it. First, it expects to read settings from the deploy environment, from the environment, not Java system properties. If it doesn't find those, then it does a fall back to look for them in the system properties. And if it's not there, then here in the settings file, it has some reasonable defaults. So that's how that works. And that has all of our Spark, Cassandra, Kafka and application settings. So that's this settings class right here. This embedded Kafka. So I wrote this little it's just for a local Kafka instance and it in the background has an embedded zookeeper in it. So you can run this stuff for quick prototyping. And it's actually available in the embedded module of the connector. So if you wanted to use that, we actually use it for IT tests as well. So give that a look if you need something like that. That's really simple. We create our Kafka topic that we create our Spark comf with any kind of configuration we want. Here I've got my Cassandra host, some cryo information. And then I create here the Spark context and then my streaming context. I create my actor system and then here I'm creating I have a node guardian actor. This is basically my major supervisor for the application that I'm going to bring up. And then I created two clients that are basically going to run in the background and run all the API calls. Sending the message requests in and then the application sends it back and so I can log everything so we can just sort of exercise the application. So let's go into the guardian. This is a standard ACA actor. It's cluster aware. I haven't fully hooked up the cluster yet, but it's going to happen pretty soon. So in here we're creating all of our Spark computation actors. We've got our Kafka actor, our Kafka streaming actor, our temperature actor, precipitation, our weather stations actor. So these are all really just modularized around the type of work we want to do in each of those. And what this does, so there's this requirement with Spark streaming where you, I think that TD at Databricks said that they're going to change this soon, but right now you have to declare all of your streams, your Spark streams first, the output streams for starting the Spark context. Soon you're going to be able to, it'll be different, but you can't, once you've started the Spark context then create other output streams that that hopefully is going to change. So what I do is I bring up this Kafka actor first because that, let's go into this guy. That has all of our output streams in it. So our Kafka streaming actor, all this guy does is it creates the primary stream. A note again here, we're working with streams so we're reusing these pipelines. The first thing I want to do is I'm reading from this raw data topic from Kafka and immediately I'm saving that raw data into a raw data table in Cassandra. Just boom, snapshot. And then I'm going to do further aggregation work on that. But that way we can always go back to Cassandra and get our historic data and do whatever work we want to on it. So after I do that snapshot, I'm going to, first I'm going to do some really basic daily precipitation aggregation on it. So that's what we're doing here. We're doing for each hourly weather data per weather station that comes in, we're going to shove that thing into a table of, because of the data model, we want our weather station ID, which is going to be our partition key and going to tell us for data locality reasons what node that's on. And then we have year, month, day and the one hour precipitation field. And we save that after the Cassandra table daily precipitation. Now once we've declared all of our output streams, then I just shoot this little message back to the supervisor, output stream initialized. So let's go back to this really quickly. So you can see little Aka fun thing to do. I have two types of receipts. I have this receive where I say uninitialized or else super that receive. So here's, we start up this actor in an uninitialized state. The only thing that it can accept is this output stream initialized from the Kafka actor. Once it gets that signal, it goes into the initialized function. And from there we can start process, we're ready to start processing things. Okay, so this is the initialized function. What it does is then starts the Spark Stream. It sets a checkpoint first actually, which you want to do before you start the Spark Stream and then starts the streaming context. And then we can say things like context become initialized. So we're changing the state of this actor now. And then it can start receiving requests like temperature requests. And these are just traits. Temperature request, precipitation request, weather station request, so any case class of that type. And again, these are all serializable. It forwards that onto the respective actors that handle that workload. And then we publish this to the event stream, node initialized. And that's just so that my client, my clients that I set up that are gonna listen to everything, they know like, okay, now I'm gonna start hitting this with like some requests. So the Kafka Streaming Actor we talked about, let's go into some of the other actors real quick, just to show you some other computations you can do and working with futures, which is really important. So our, actually, I wanna show you one really cool thing. About five more minutes. Okay. One really neat thing actually is right here in the Kafka Streaming Actor. In this computation, normally if you weren't working with Cassandra and if you didn't have a particular data model, you would have to do a reduced by key. So that's an additional workload that you would have to do, computation that you would have to do. But because of the way that the data model was set up, and you can see this on GitHub, the data model, we don't have to do that, Cassandra's gonna do it for us. And I think I describe it in the comments here. So our precipitation actor, now that the Kafka Stream does the computation to get the aggregates of the precipitation data into the daily precipitation, like the roll-up table, you know? Then we can, when we need to, do some kind of calculations based on that. So like here, this has been talked about a few times, a top K. I want requests come in for a particular weather station, year, and then particular K is like, do I wanna do like top 10, top 20, whatever? So I can do this in a four comprehension where I say Spark Streaming Context, Cassandra Table. So I'm gonna read from that roll-up table, select precipitation where the weather station and the year map in. And I can call collect async. And that returns a future. So in my four comprehension, now I have that sequence of aggregates from our zero to 23, if they exist. They might, you know, you might have zero up to 23 hours of data in there. And then in the yield, then I can say, oh actually I meant to do changes to parallelize. Yeah, not make RDD. So then I parallelize that sequence with Spark and I can just call top K, you know? There's no boilerplate, I just let Spark do it for me. And then I can throw all of that information into my top K precipitation case class. And then because it's a future, I can pipe that, pipe to that's ackup.pattern.pipe. So I just bring that into scope and then I can call pipe to the requester. And the requester was, you know, whatever actor is gonna come and make the request for top K, right? Cumulatives doing something exactly similar except it's doing like annual precipitation requests. Same thing, getting an aggregate of information. It's giving us back from Cassandra. It's giving us back a future that we can create our case class, annual precipitation, pipe it back to the requester. Everything's asynchronous, nothing is blocking which is the goal. And finally, let's do the temperature actor. This guy is basically doing a very, very similar pattern of work. One thing, this was the historic thing someone asked about that, so historic data. So there is a constraint in Spark so I needed to, ultimately what I wanted to do as far as a pattern, is that for me, like stop? Okay, two minutes. Okay, so the constraint in the Spark API is that the Spark context and the Spark streaming context are not serialized. So I cannot say from my Kafka stream, you know, consuming from this particular topic if I want to get in every time I, you know, receive a new hourly weather data point. I want, what I wanted to do was say for this weather station for your month, this particular day, now I want to go query Cassandra and do my aggregation in the stream but you can't do that because the Spark context is not serializable. And believe me, I tried like every different way. I asked T.D. who's the head at Databricks of Streaming. He couldn't even figure it out but I mean it's just not serializable so it's not possible. So I had to figure out like, how am I gonna do this requirement? So what I did was leveraging ACA and you know, I'd rather have done it the other way in the stream. So what I did was say I have a request for get daily temperature comes in and then I have this day key. So what I'm gonna do is based on the day key information then I'm gonna ask Cassandra for that information from the raw data table that we originally saved off. I'm gonna select temperature, wear, blah, blah, blah and matches the day so I get zero to 23 hours of that daily data back. I get it in a future aggregate and then I construct that this four day can then go in. Now this is the thing. This is the workaround anyway. This is a good thing to know. This stats counter, stat counter is in the Spark API. It's serializable and what it's gonna do is it's gonna take all of the sequence of temperatures and based on that automatically give me a mean standard deviation and a lot of different statistics like that just out of the box. And so I'm gonna construct that data. What I'm gonna do is take that case class, the daily temperature, I say self send data. So I'm sending it back to the actor so that I'm going to asynchronously process this within the actor but I'm still not blocking. I'm gonna immediately send it back to the requester. But then what I do is in the actor up above and it's receive, it gets this daily temperature asynchronously and calls store. So then it just says parallelize it. Here's my sequence of info. Save it off to Cassandra to the daily aggregation table. So that was kind of a weird workaround but it works for the moment. Hopefully that'll change in the API and I think, let's see if I still have it, yeah. So this is the app on GitHub. We have a wiki, killer wiki. And so it's in this killer weather slash killer weather. So the whole application is here but in the killer weather app is where all the computation actors are. So, oh I do have an internet connection. So everything's here, there's a little read me and it tells you where to start to look at setting up the, do I have any more time? You don't need it. Two more minutes. So I'm starting up Cassandra 211. There it goes all happy. I'm going to start up a CQLSH shell and I'm going to, let's see if I remember source. I'm going to create, what do I call it? Time, shoot. Series.cql. So this is my create schema. And then I'm going to also do source, load, time series. All right, so we create the schema in Cassandra. We load some initial data, blah, blah. But, okay, so then let's see if this works with the connection that we have. Then I can do SBT app, fingers crossed. Demos don't always work. So it's going to bring up the app. You'll see some of the logging. Start zookeeper, boo. And then starts Kafka, yay. And then it's going to, so this Kafka message count, I've got one Kafka consumer. It's just going to tell me how many messages it's consumed on this topic. And then it's going to start to send the API requests to the node guardian, which forwards everything off to the respective actors. And we're going to see that start happening soon. I specifically have those wait because I'm waiting for some of the data to come from Kafka to Cassandra first. So we requested, here we are requesting, current weather for weather station, received daily temperature. So this is all the client, and you can see all of this in the code. So that's it.