 Thanks everybody. And thanks for coming to hear me talk. I promise I will be as fun as the Beach Party. So quick show of hands. Who here is either dealing with or is going to have to deal with a lot of real-time data very shortly? Yeah, so a little bit of a selection bias I guess. And how many of you are despairing that you might have to turn to let go or scala or closure or some other data? Well, I'm here to tell you there's a better way. You can actually now use storm and stream parts to do that without writing a single line of scala or closure. All right, so just a quick background for those of you who aren't familiar with the difficulties of doing Python like multi-core programming. I'm sure somebody is having a fistfight about this very topic right now, somewhere in the building. But Python has the gill. Does everybody here know what the gill is? The Python's global interpreter lock? Yeah, so a few hands. For those of you who don't, Python doesn't actually do true multi-, doesn't compute on every thread at once. There's a global interpreter lock that prevents bytecode from executing on different threads simultaneously. This is normally not an issue for things like IO bound tasks like fetching data from the internet or fetching data from disk. That's the sort of thing. But the moment you try to do any sort of computationally intensive task on more than one thread, you get lock contention, you get slowdowns, you get all sorts of terribleness. Traditionally, the way this has been dealt with, I'm sure many of you here are familiar with this, is with a queue and worker system, which bypasses the gills, the traditional way to sidestep the gills. So you've got a queue, which is like Redis or Rabbit MQ, one of those things. Then you have workers like RQ or Celery that pull things off the queue, do some sort of processing on them and push them into another queue, which another worker pulls from and so on and so forth. This gets a little hairy. Oh. And also, we're not just interested in beating the gill anymore. We're going to max out even the largest of boxes. So we need something that can only scale out to multi-core, but can scale out to multi-node or cluster implementation, as we've seen up, because it will max out even the biggest AWS instance eventually. Just quick before I continue, there's a really great in-depth dive into the gill and lock contention, and a lot of the arguments around it are CTO gave that I have a link to that we'll share in the slides after. Anyway, the queue and worker architecture can lead to really, really complex diagrams. This is a diagram of what the partially architecture looked like a couple of years before I came on board. And it was, quite frankly, it was tough to onboard new engineers with. It was tough to keep track of. You had queues, you had workers pulling from different queues, pushing to new queues, pushing to all sorts of different databases, and you had to keep track of which to deploy where, what had been deployed when, and it was tough, and it took a lot of maintenance and just manpower to keep it going. But we found Storm, and Storm is fantastic. Storm is purpose built. It's a distributed real-time computation system that simplifies this whole worker and queue business. And now, thanks to Parsley's work with StreamPars and PyStorm, we have a completely native interface to Storm that lets you write your code and deploy it to a cluster without having to write a single line of closure or Java or anything you don't want to write. Just a quick background on us and what we use Storm for. We are a web analytics company, and we do content analytics for digital storytellers. So like Condé Nast, Mashable, and we ingest tons and tons of page view, heartbeat event data for these publishers so they can more easily track visitor loyalty, you know, engage time on page, have any page views, certain sections or tags or posts got, et cetera. And we use this data to power dashboards like these, which are available to editors, writers, anybody within the organization to see their performance, and also to power on-site APIs. So if you've ever seen what's trending or what's popular or you might like, we offer a robust API system to pull data out of our system and into populate front-end widgets. We also offer just recently a data pipeline, which is access to our raw data. So not only now are we processing data to these dashboards at the API, we're now enriching bare metal data for our customers to build their own dashboards or products or whatever they'd like to build on it. And a lot of times when I'm with other programmers, when I talk about what we do and the scale at which we do it, I hear a few things that maybe you guys have heard. Python can't do this. The free lunch is over. Python doesn't scale. It's just a scripting language or a glue language. Well, you should have used Go, Rust, Scala, C, Java, name it, someone's recommended it to us. Except Python can't scale. That's a screenshot of Htop running across a couple of machines. And that's all of our cores maxed out. Using just a couple of strom topologies we deployed to the cluster. And it scales quite well. This is just a quick overview of the amount of data that we took in in 2016. It's a lot of data. I think just the US election day alone, we brought in, I think, like 2 or 3 billion events. And it's all done in Python. So Python can't scale. A storm can help you. And so can stream parse. And I'm here to help you do that. So stream parse is a Python storm. It's an open source toolkit we've developed to help you get off the ground using storm. As the name implies, it lets you parse real-time streams of data. You can integrate your own Python code with Apache Storm, which runs in the JVM. So there's a multi-lang interface. It's got a great quick start and documentation, good command line and tooling. As production tested, we use it every day. It's very mature. It runs the absolute core of our real-time streaming system. And yeah, it's good for anything that requires sub-second latency. So analytics, logs, sensors, these sort of things. So we're going to just quickly try and cover is what's storm topology and how do I use it to process my data? How does it do it internally? How can we use Python with it? Quick overview of stream parse. And then some examples of distributed systems that we do at Parsley. How many of you here are familiar with storm or have used storm already? Quite a few. Perfect. So then I can try to push through this a little quickly for those of you who know what's going on. Storm uses a couple of different abstractions to describe the overall computational graph that you create. You've got a tuple, which is just an individual record of data that's passing through the storm topology. You've got a spout, which is a source of data. You've got a bolt, which is a component that processes input and sends it along the topology. The input can be a spout or another bolt. And the topology, which is the overall design, it's a collection of components that create the computational graph. And here's kind of an example of what that might look like. You've got a spout, which can be any source of incoming data. It can be Redis. It can be a Q. It can be Kafka. What have you? Redis and Kafka overwhelmingly are the spouts people use, but it doesn't have to be. And then you've got different bolts that it passes the data onto. You have to do different sorts of processing, filtering, transformation, whatever it is you're doing to the data. And usually the final step is some sort of ETL process into Cassandra Elasticsearch, what have you. So a tuple, just a single data record. And you can think of it just like a Python tuple. So the field spec here is word. The word is dog. If it's word and count, you've got dog and an integer as the second value. And here's an example of a spout, what might be an incoming spout. This is kind of an arbitrary spout. It comes from our quick start, from our sample word count topology. And you can see here, this is basically just a spout that cycles through a predefined list of words and emits the words. The important part, of course, being that it's written entirely in Python. You don't have to go into Clojure or Scala or anything. I skipped ahead of myself a bit. But yeah, spouts can be any source of input. It doesn't have to be continuous. A lot of people, there's some confusion here. It's totally okay for a spout to stay idle or sleep a little bit and pull an incoming data source for new, it doesn't have to be a continuous stream of events. So you can use it for even things that just get, I have burst streaming every now and then. You can still use the spout floor. And it will be totally fine. And in fact, we use it that way as well. Here's an example of a bolt. So you can see here, the bolt takes in a tuple that we just passed to it from the spout. And in the process command that you override from the bolt class, you can just do whatever you need to do with the word. In this case, we're arbitrarily incrementing a counter based on whatever the word happened to be. But you can do things far more complex than that, of course. And once you've processed the tuple, you can then pass it on. So you can create very complex computational graphs this way. Topology is simply the description of the graph itself. So you can see here, nothing crazy. We've got our word count topology. We've got a word spout. And then we've got a word count bolt that takes input from word spout, groups on the word field, which we'll cover in just a second. It has a parallelism of two. And so you can describe your entire topology. This is a very simple one. You can have big topologies with many bolts and spouts. And you'll be good. This is just a quick overview of that. So you've got your input, your output. So the count bolt has the word spout as its input. Count bolt outputs words. You can see how it outputs the enriched tuple. So the original tuple comes in from the spout and the rich tuple comes out from the bolt. Something I would like to touch on is the grouping. Storm does have something really cool, which we'll just get to the next slide. You can actually, for stateful processing, you can make sure that based on certain values of the tuple, that it gets passed to the same task for consistency. So since we're grouping on word, that value will get hashed. And then Storm will make sure that tasks for that value will keep going to the same task, which can be really useful. For the bolt, shuffle just means that the tuples will get passed around to tasks. It'll be balanced out. So the tuples will get allocated out to the tasks fairly so that no one node will get rebalanced. Something really cool about Storm is that it actually acts and fails. It has a very robust messaging system as the tuples move through the graph. So Storm actually acts and fails every individual tuple, which makes it really easy to do a reliable messaging using Storm. So you can do at least once, exactly once, that sort of thing. You can keep track of every tuple as it's going through the chain. Stream parse actually implements auto act and auto fail for you. But you can also do your own, you can override those and you can specify when a bolt acts or when a bolt fails a tuple. And then you can replay it right from the spout. So it's really fantastic. It has saved us quite often. The fact that it's kind of like by its very nature, by the tuple tree's nature, you can actually track the progress of a tuple through the system. I won't linger too long here, but Storm also has a really nice UI. Once you've deployed Storm to your Nimbus, the master node is called a Nimbus. Once you've deployed to the Nimbus, you can log in by a really cool web interface and you can see topology summaries, you can see spout and bolt summaries, you can see number of tuples act and failed, number of tuples emitted, you can see exceptions on each of the workers. It's really useful. It's a great tool for doing real-time distributed computing. And the way it does this is you just have a cluster. You have a Nimbus node or the master node and you have worker nodes, each with a certain number of slots that you specify, depending on how computationally intensive what you're doing is. And then you simply deploy the topologies to this remote cluster and the topology is based on what you've configured will use up the slots of the worker nodes to do the computation. You can run into issues of contention there. Of course, if you try to deploy a fifth topology, there won't be any more space. So your topology will just kind of sit there with no worker processes. You can then rebalance the storm cluster to take that into account and stream parts can help you do that. So, storm is pretty cool. If any of you have been searching for a real-time distributed system that can do this at scale, you've found it. Storm guarantees processing via tuple trees. You can tune different parallelism for component so you can have fewer slots or more slots for less or more computationally intensive tasks. It's high availability. If a node drops, the storm cluster can take care of it. It actually uses Python process slots. Each slot actually spawns a new Python sub-process. So it sidesteps the gil. It's all running its multi-process. It's not threaded. You can rebalance computationally intensive tasks across your cluster and it handles acting and failing and messaging automatically. Except until now, it was really hard to use it with Python, which brings us to our main thrust. Storm does... We want to use Python for this, right? Nobody wants to write Go or Scala or maybe do, but that's not over here. We're here to stick it to all the people who said we couldn't use Python for this. So there is something called a multi-lang protocol that storm implements. This allows you to interface with the storm cluster in a different language. Storm does ship with something. If you have tried to do this before, you've probably encountered something called storm.py. Has anybody encountered storm.py? Yeah. So there's a couple of issues with storm.py, right? I should have had a slide for this. It's not really that Pythonic. It's more of a reference implementation than anything. And it's a bit janky. You got to package up your own jars. There's quite a few issues. So we decided that storm.py wasn't really taking into account all of the neat things that Storm's multi-lang protocol allows. Because the multi-lang protocol on its face is actually pretty cool. It supports... It's just JSON passing back and forth between the Java implementation and the Python shell. It works by a shell-based components, communicates over standard in, standard out. So it's quite clean, very unix-y. It doesn't use Py4j, which I don't know if anybody here uses Spark. But if you've ever hit a Java stack trace and you're like, how did my code cause that? You won't have that here. And it's also pretty simple to implement. Sorry, I'm a bit redundant. But yeah, one process per task speaks to JSON. And you can control parallelism. The handshake between the Python process and the Java implementation will control the configuration and the context of the process. So it's really neat. But up until a couple years ago, there wasn't really a great way to use Storm. And I should note that there's two projects that we have. There's PyStorm, which is really under the hood. It's the multi-lang protocol by which we communicate to Storm. But what we're going to focus on for the remainder of the lecture is Stream Parse, which is your interface to Storm that uses PyStorm under the hood. But we do maintain both. So Stream Parse. We initially released it a couple years ago in 2014. It's had over two years of active development and again heavily, heavily battle-tested. It's got a lot of stars on GitHub. It's got a lot of contributors, 31 contributors, and we have three engineers actively maintaining it. And yeah, it's battle-tested. We pass tons of data through it every day, many millions of events. And it does great. I was going to do a live installation of Stream Parse, but in the interest of time, I'm just going to, I just pre- prepared some of these commands for you guys. But yeah, all you have to do, it has line, the only outside dependency is line, line again, which is a closure compiler. But once you have line installed, just create a new virtual environment, run pip install stream parse, and then do sparse, which sparse is the command line interface, sparse. Quick start. And then you can just change into the directory that it creates. And you can actually just locally run a Storm topology right on your machine. Oh, I should have mentioned this. You do need to have the Storm development environment set up on your machine, but you can just download that from the Apache Storm website. It's a quick, it's basically just putting the Storm bin on your path. And I do have a gif of this in action. It's super easy. And you can see here, you can see it's compiling the jar, it's doing everything for you behind the scenes. And then it's off to the races. There you go. And you'll notice all of the cores were, were being utilized. So it very neatly sidestepped the gil. And it was a five minute from, well, not even five minutes, but from pip install to fussing with it. Which is, you know, we think quite nice. And the same idea. Once you have a remote cluster set up and configured, you can simply type sparse submit, and there's a small config JSON file that's documented in the help docs that you can, that you give it the, obviously the host name, the parameters, et cetera, and so forth. And it does all of this, which is really great for deploying. Makes a virtual and across the cluster and installs requirements, builds a jar to the source code, actually opens a tunnel to the Nimbus, constructs the topology spec in memory, and then uploads the jar to Nimbus and starts the topology. So it's actually really great for kind of taking the headache out of deploying. You don't have to worry about jars, you don't have to worry about out-of- date requirements, or writing a fab file that manually updates all the virtual environments. Sparse can do this for you. So it does take a lot of the headache out of kind of deploying these topologies and maintaining them. And there's a lot of, oh, I'm sorry, there's a lot of other commands that it has. That's just a quick overview of them, some of which are diagnostic, some of which are functional, but I encourage you to explore them all. And so this is an example of how we use it in production. So we have a couple of spouts coming in, and we actually take the page view events in real time and pass them through the topology and batch insert them into an elastic search cluster. And you can see it's not a very difficult diagram, not a very complicated diagram, but it's extremely robust and it's one that we, it's an abstraction, of course, but it's an abstraction of one that we use in production every day. It's being used right now. So there's a, you don't have just regular bolts that do just streaming. You can also have batching bolts, which is great for database. Databases don't like to be hammered with one at a time insert and update requests. So you can batch, you can batch a bunch of tuples every one or two seconds or so and do a one batch operation on them. And so we, there's already classes for that batching bolt and tickless batching bolt. And also remember that we act and fail every individual tuple. So there's also already a class in stream parse for, for, for messaging reliability. It's a spout that will automatically replay any tuple up to a certain amount of retries that you specify, which makes it great for easy, easy low latency, high availability messaging, reliable messaging. There are a couple of, I'm on the wire here, there are, there are a couple of performance considerations. A couple of you during the multi-lang protocol might have thought, hey, isn't passing JSON a lot of overhead to serialize and de-serialize as things move, move through the topology? The answer is yes. So if you're processing lots of little small messages, it's better to use the batching bolt and, and batch them like once every second or so. And it's best to filter out a lot of those, if you're processing a lot of tuples, filter them out quite early. So you're actually just not passing through as many, as many tuples. Don't let grouping overwhelm you. It seems really nice and convenient to have stateful processing, but if your data is imbalanced, if you have a, if one value is ultra common, grouping will swamp that one executor, so huge shuffle, unless you have to, and use several small topologies instead of one huge one. It cuts down the amount of work if a tuple fails. Because remember, when a tuple is replayed, it gets replayed all the way from the beginning of the spout. And also storm is more efficient at tuple tracking in smaller topologies, which is actually a recommendation we got from the storm devs themselves. And final slide. Cool stuff for it we want to do. We want to implement a Kafka reader, Kafka reader and writer bolts, which is one of the huge use cases for stream parts, and message pack, which is a binary serialization format that cuts down the overhead of JSON. And that's it. Yeah, that's stream parts, PyStorm, and we're parsley.com. Okay, thank you very much. Who has a question? So I wanted to ask, why did you decide to use storm and not, for example, spark? Actually, if I had the time, I was going to go into why we didn't use like spark micro batching. So we actually do use spark in our kind of historical analysis layer. But we found that spark micro batching doesn't really let you do sub second latency. So it just, it just made sense to use storm really. But I think if you're not worried about latency, and you already use spark, I think micro batching is perfectly fine alternative. Here's another one. Okay, so imagine that I would create like a real time system that aggregates some data. For example, we have like 100 requests, how 100,000 requests per second. Okay. And we need to like, get some aggregations. For example, you put strings like cow dog, whatever. And I want to tell like how many cows were there in a second in a minute in an hour. So like, do I attach some fast storage to stream parts? Or like, how would you do that? I can tell you how we do it. Because we do something similar. So we, when I was talking about engage minutes and page views, we actually aggregate those in Elastic Search Database. And we're not at quite 100,000 requests a second, but we're in the ballpark. And the way we do it is we actually have in stream parts rights to a Cassandra cluster, which then every, I don't know exactly the amount of seconds, but every X amount of seconds, we create like a roll up document in Elastic Search, which we then insert with the aggregate count and they're bucketed by timestamp. So you could actually do Elastic Search in Cassandra, that whole pipeline, actually handles page view by page view quite well. It actually, the latency is pretty good. But you run into issues of storage if you have, if you have one Elastic Search document per page view, which is why we do the aggregate roll ups. But I see no reason that when a similar situation wouldn't work doing like aggregate roll up documents in a database. More questions? Thank you for the talk first. And the question is if we want to deploy it on AWS and support the auto scaling, what possibilities Storm provides to support auto scaling? I'm actually not sure as far as the auto scaling goes, unfortunately. We do all our node scaling by hand because we use reserved instances. So I'm actually not sure. Sorry. Hi, I wanted to ask how the stream parts programming model compares to normal Storm or Trident or maybe Spark or Flink. Sorry, say again. How the programming model of stream parts compares to Trident or normal core Storm or Flink or Spark? What are the abstractions? Oh, OK, yeah. I'm actually not super familiar with Trident, so I can't speak to Trident. But the kind of I think the core abstraction that you would compare to the Strom topology to in Spark would be the RDD. And in a similar way, from an RDD you would fetch data and then perform like different functions on it, you would pass like, you know, you would keep calling different functions like MapReduce or aggregate or whatever on the Spark RDD. The equivalent of that would be Storm's topology, which is you have spouts and bolts and you could I suppose reimagine the bolts as functions. So this tuple is getting processed through these functions up until the end where it gets loaded or otherwise evaluated somewhere. I suppose would be the closest abstraction. Maybe one last short question. That's a good one. So you're not the first and you're not the last to ask. So we're basically engineers that focus on so we are product team does like actual core development work and features and we develop anything that is customer facing or client facing. So I would work on some of those dashboards or the data pipeline ETL or the API stuff would all fall into the realm of so success engineers or engineers who are also will respond to customers problems and also work on features that directly impact them. OK, thanks, Alexander again.