 Everybody, thanks for coming out. So without further ado, let's get talking about activity streams. The talk today is activating your site, a look in activity streams presented by us. We're gonna go over briefly what they are, how you can implement them, and why you should care about putting them on your site. So just briefly, I'm Justin Ben-Farhan. We are all work at National Geographic, which uses a lot of Django on some of our sites, and we've been doing this for a while. So again, here's the agenda. We're gonna talk about activity streams, what they are, why you should care about them on your site, some of the engineering concerns around implementing them. And then we're gonna talk through the open specification regarding activity streams, and then two solutions, one expressly in Django, and the other one as a service that can work for any website. So what are activity streams? So they're actually everywhere. So GitHub, Facebook, LinkedIn, Twitter, et cetera. You probably run into them one of these at least once today. They essentially are a way of displaying actions that people take on your site. And they also have a way that people can somehow socially link themselves to other people who are also making activities and amplify the amount that they're seeing in their stream. So why do all this stuff? What's the point to all this? And it actually turns out that it's really good in increasing engagement on your site. So even if something as simple as putting a like button on your site, we'll drive traffic immensely, because essentially people like clicking on things. We try this out on Nat Geo and just very basic, without getting any, a lot of user feedback or anything like that, people just like clicking things and say, yes, they approve this content or whatever. And also, so that's the first half of being able to publish your activities to a site. The other half is being able to consume them and see what your friends are doing. And with those two in combination, you actually set up a really powerful, positive feedback loop that will drive lots and lots of more content and engagement on your site. And that's a really great thing to do. But in terms of engineering, we get a lot of data out of it, hooray data. And with that data, we do a lot of fun stuff, including tons of analytics. You can drive recommendations like Netflix, show you related content. We have social graph apps. We can show you trending, you can drive A-B testing and tons more. But there are some engineering considerations that have to be taken into account implementing this. And Ben's gonna dive into that a bit more. Hi. So these are just some of the problems that are presented with activity streams. There are no great solutions to this. It's not a simple, clear cut answer. But it's all about weighing the benefits of each one of them. I'm gonna dive into each one of these a little bit. So the first problem is too many peppers. Essentially, there are so many implementations of activity streams out there, and that's kind of evil. We don't really get the benefit of cross-site implementation. It would be really great if I could do an activity on Facebook or somewhere else and then get that same thing at National Geographic or another site. It causes duplication of work and there's no common semantic structuring. So it's kind of hard to implement these things and come up with the terms on your own every time. We're gonna talk a little bit about the solution to that in a little bit. So the second problem that we might face is what do we actually store and what type of data structure? And in what schema? Is this gonna be a relational database, a non-relational database? What's the data structure? There's different things in terms of efficiency that could come into this. Are we doing an adjacency list, an adjacency matrix, W-linked list, something else, a key value store, just a hash. Are you gonna have more rights or reads? So if you're a social media company, you might show the news feed a lot and then you're gonna have a lot of reads. If you are a company like National Geographic, there's mostly content. I'm not gonna show you the news feed that much and I'm gonna have a lot more rights. And then do I store the activities in the same place that I store the actual stream? Where was it pre-computed or something like that? I'm gonna touch that in a little second. So another problem is centrality versus sparsity. Haters gonna hate. You're always gonna have your celebrities. So those are people who have a lot of followers or do a ton of activities. And the reverse to that, which is sparsity, forever alone, I only have one friend or I've only done one activity. And each of these things present a different problem. For instance, for the celebrity, do you start to enforce limits? At some point, that's gonna be too much to compute. Facebook, if I remember correctly, has a limit for 5,000 friends. And do you solve that with a data limit? Is that a hard limit? Or do you do some sort of UX solution where you hide the follow button at some point? And then the other way, the other problem is, what do I show a person that has one friend? What do I show a person that's only done one activity? Am I gonna show them stream with that one thing? A stream that has only that one friend's activities? Skillability, obviously an important factor here. What am I dealing with? Am I dealing with a massive amount of activities or just a few? Where do I do this computation? Am I offloading it to somewhere else? Am I doing that inside the request response cycle or doing something else with it? And how do I handle complex queries like friends of friends or follows or recommendations? And then real time versus pre-computed, when do you do your computation is an important factor. Are you pushing out in real time? If so, how do things take precedence? And are you doing some sort of half and half where you're pre-computing some stuff and then computing on the fly for other things? And what are you actually sending? Are you just calculating the ranking of things? Are you doing the entire thing, including the HTML and then just sending that off? What are you really going to compute? So I'd like to invite Justin back on and we're gonna talk a little bit about the solutions to these things and some of the implementations that we came up with. So there is a solution for some of these problems. It turns out there's an open specification called the activity stream specification that aims to solve at least the very implementations by solidifying everybody on one semantic structure and then the data and how it's stored. And so this has actually been around for a while and all of these companies here are implementing it in some fashion or another. We implement this at National Geographic and the other solutions we show you today also implement it. It supports Adam and Jason data structures out of the box and in a human friendly and machine processable way. Quick disclaimer, nobody up here is officially involved in the specification. We're just implementers who are showing our, or sharing our experience with you about this. So the spec looks a little bit like this. So there are three big entities. The actor, which is the required one, normally that's the user on your site. It's whatever's taking that particular action. The verb phrase, which is whatever actually happens like commented, post. The spec defines a whole list of officially supported verbs. But for most of the sites I've seen this implemented, they take those and then the ones that aren't there they sort of just run with and customize their own. There's an official channel to go back and propose verbs that aren't there to the draft specification, which I encourage you to do on their website. You also have an action object which is the primary object of the activity, whatever gets created essentially, whether it's a photo and an album or a comment on a blog post. And the last entity is the target is wherever that action's taking place. It could be like the album or the blog post, wherever the action is going. And then there's a little bit of meta information like what the timestamp is and a descriptive title on summary. And so this is what the JSON implementation looks like. Real quickly, a quick example. You see the actor, object and target have their own JSON object right in there. And we sort of stuck to sort of like Jason is the first class citizen of the rest of our implementations. So I'm gonna talk to you right now about Django activity streams, which is an open source project that I wrote and has been used in several different sites. So it uses the specification. It can track any object in your Django project. It runs on any supported Django database and it keeps track of everything using generic foreign keys. It also provides a way for you to render these streams onto your site using template tags or feeds. Everything's generated at request time and I leave the caching up to you. And you can read along with the source code at my GitHub repository or also read the docs. So when I'm not working at National Geographic, I'm CTO of Nalwa Studios, which is the gaming company that runs the humans versus zombies game. And APC is essentially an organized game of tag that's played at colleges, universities and other locations all over the world. And we do it with like thousands of players. We're doing hundreds and hundreds of actions a day. And we needed a way to sort of give that feedback and show people their streams. And so this is HVD source, the Django site that we set up to help moderate that game. And you can see a simple action right here is just the row. Crazy face, the actor selected the verb, the original zombie is the object and the demo games, the target. And then the timestamp is in the time since filter, all displayed right there. So behind the scenes, it uses two models to accomplish everything. The main action model has generic foreign keys named actor, target and action object that can point to any object in your Django database. It doesn't have to be a user. That also has that descriptive meta information I was talking about as well. The second model is follow. It has a foreign key to your user, whether that's Django auth user or it also supports custom user. That's up to you. And then a generic foreign key you point to any other entity that you'd like to follow. It doesn't have to be user, it could be anything else. And it maintains that relationship in the database. So generic actions are pretty simple. You just import the signal and send it along with the arguments. So crazy face is the actor first and foremost. And then all the other arguments are sent in as keyword arguments. So crazy face selected the original zombie for a target, which is demo game. Following and unfollowing is similar to Lurie. Also simple. There's just a follow and unfollow function that take a user first and then an entity second and either create or destroy that follow object. And there's also followers and following which would return a query set of users that follow that given entity or the other one as the reverse lookup and gives you a list of entities that that user is following. And of course there are streams. What good is this app if you can't show it? So it comes with a few built-in streams. User stream being the most important one that takes a user finds their followers and then gives you actions that those followers have done. That's like your main dashboard of GitHub or Facebook or Twitter. That's probably the most important one. And then there's similar other ones that are actor, target and action object that just do a similar lookup based whatever context that that given object is in and that action. So actor stream and crazy face will show me all the actions where crazy face was the actor. Model stream is another interesting one. It'll show you any and all content or any and all actions that involve particular content types. So this will show me anything that is happening with any user model. And so if I were to graph out the interesting one, the user stream query, essentially you're given the user object. It goes and finds your followers or the, sorry, the content that you follow and then it reverse looks up the actions that were involved those objects. And it can also filter it down by relationship. By default, the user stream just returns everything wherever your follower was involved in but you can customize that as well. So those only get you so far. There's also custom streams which are easily to implement. So basically this first one right here, player actions takes a game instance. It finds out the number of player IDs that are in this game and then it returns a query saying, show me all the actions where the players of this game were the actor and it does that by object ID and content type. The stream decorator gets you a little power where it will, you can just return query set arguments or keyword argument filter. You don't have to return a query set. And then the second example here is the player actions by slug. It simply does the same thing that the first one was doing, only it takes a slug instead of an instance. We're gonna get back to that guy a bit later about why that's useful. So to graph that out, it looks pretty similar to the first example. You're given a game, you find the players in that game and then you do a reverse lookup to find out where those players made the specific actions when they were the actor. So it's a little, it's pretty straightforward. It's the generic foreign keys work from the player objects to the actions and keep track of the relationship that way. So the first way to put this on your site is using template tags. And the activity stream template tag is the most helpful because that essentially takes, the first argument is the name of the stream you're interested in, like the actor stream, then any arbitrary objects you wanna pass in through that tag. This returns a stream object in context which you can iterate over and display however you want. There's a helper built in called display action, which is just an include tag that renders a specific template that you can override in your project, but you can also display as however you want on your site. It also works for custom streams. So you just give it the name of your custom stream, pass in a game instance and you're golden. The second sort of overall function that the templates give you is ability to create follow buttons. So this guy essentially is a link that's a toggle. If you're not following this person, it will give you a link to follow them. If you are following this person, it will give you a link to unfollow them. It's sort of like the toggle that you see on GitHub or any other social media sites. The second way to get information out of the app is through feeds. So the user feeds first and foremost. Any of these support either Adam or Jason. For the user feed, as long as you're authenticated and you go to that URL, it will return out the user feed in the machine readable format. Object feed does a lookup for specific object by content type and object ID and then returns you a stream of actions where they participated in that in any relationship. Model feed, just like I showed you just does everything based on the content type. Interestingly, you can also add custom JSON feeds like this. Like I showed you with the game slug, the second custom one before, you can actually pass URL parameters directly into your streams and render things out through a custom JSON feed that way. So in implementing this, I ran into a couple of database considerations that are important to note. So if you were to write this thing naively and write this on a template, you'd essentially get one query for getting all your actions and then as you were iterating, you would do a hit to the database for every actor, every target and every action object and this gets incredibly complex after a while. So luckily in versions of Django 1.4 newer, you have prefetch related which essentially combines those all and it just has big O of C where C is the number of content types overall. So it drastically reduces the number of database queries. It makes the queries a bit beefier and then Django gets that information back from the database and then shuffles everything together in Python to give you your final query set. So Django activity streams use this under the hood. You don't have to worry about this but you can extend it if you'd like further on. Also, since we're dealing with generic foreign keys, there are some limitations. Like the aggregation and annotation API of Django will not work for generic foreign keys that I found. So like this guy right down here that tries to find a count of actors whose health is greater than five just will not work. So this leaves, unfortunately this leaves out a lot of the interesting things like recommended content, like most popular, like a lot of the more interesting queries that you'd like to get a handle on. It's sort of false short and you have to do a lot of ugly SQL to get things the way you want to do. So there's sort of a better way to do it and that's what we've been using at National Geographic working on the Horizon service and I'm gonna toss this back to Ben to tell you a bit more about it. Blow again. So a little bit background about the Horizon service. Like Justin said, if you're building an app which just needs an activity stream, his project is awesome for that. We have a large ecosystem. We don't really have control over the models that exist within the different sites that we have and so we needed some solution that was able to deal with this type of stuff as a service. We have a great product owner that basically told us there are three different implementations of favoriting right now. Build Me One. And so we built a service and it follows the activity stream spec or tries to. There's some limitations with some of the solutions that we chose. It does a half and half sort of real time versus pre-compute mix and what's important to know is that if you wanna use this, your models must have an API. We'll talk more about why in a minute but that is a requirement. And then there's clear separation between front end and back end modules that come together with this and this again is open source. So a little electrical circuit for you about the Horizon ecosystem. I'm gonna dive a little bit into those and we'll talk about them. But first, storage considerations. First thing we look at was what do we store this in and a graph database really is perfectly suited for these types of things. And for making interesting queries. It's optimized for large traversals. It's really good at storing the relationships and we can look at individual slices based on different things which I'll talk a little bit about in terms of the choices that we made. So Neo4j was our product that we chose to use for this, the database. We looked at others but eventually chose Neo4j. It's based on TinkerPop which is a Java framework for property graphs. Titan, which is another graph database also uses the same thing. Property graph is essentially a graph database that allows you to have properties on both nodes and edges. And underneath the hood it implements a W-linked list as its data structure for relationships and nodes are just pointers to their first relationship and that's how they traverse. In terms of complexity, indexing and search are a little bit costly because it's a double E-linked data list. It has a big O of N where N is the number of edges but and then for insert and delete it's constant speed but Neo4j actually uses Lucene on top of things so when we start talking about indexing and search it gets a much better result. But what do we actually store? So we don't want to store your entire model. We can't actually store your entire model. Our situation is one where there's multiple models. We could have a photo in one site described completely different from a photo in another site and in trying to solve that and in using the best practices for something like Neo4j, we decided on five basic properties for nodes. Those are API. If you remember I talked about you needing an API route for your models. An AID, your application ID, a type of app label model name so let's say YouTube underscore video. Created and updated timestamps and then that could be an actor, an object, a target or anything else in your graph. That's the only thing that we store. For edges it's a little bit different. Neo4j has a native type so that could be followed, favorited, liked, watched, so on and then a created and updated timestamp. We also use Redis. We use Redis for sockets, sessions, caching and then some stream data. It's really good for that type of stuff and there's not much to say except that it's an excellent database. And then back to this part, I'm gonna talk a little bit about the pre-compute cycle so we said that you have to make a choice between real-time and pre-compute. We said that we chose half and half. We use Apache Storm. I don't know if you're familiar with it. It's a really great product. It's like Hadoop but for real-time message processing. It's not a dependency but a recommendation and the way that Storm works is it has these topologies. Essentially topologies or Storm in general is a processing framework that is distributed and really good at doing these types of things for processing messages. It has topologies that describe a set of processes. Those are called bolts and you could have multiple topologies. You upload a topology by just uploading a jar file into the Storm cluster and then it runs those things. It internally can run not only Java but Python, JavaScript, C-sharp, whatever you want. And communication to it is done via message queue. We use Kafka or RabbitMQ for some certain things. Storm topologies are really good because we had a problem to solve which is one part of our company might want to have larger weights on videos and another part of our company might wanna have larger weights on articles. And in order to do that type of calculation and computation, we can create many different Storm topologies that basically define different processes to eventually get us the data that we want for each one of these streams. All of that is dumped eventually into Redis. And I think I said this but I'll say it again. This is not a dependency but a recommendation. You could do everything on the fly for the horizon ecosystem. So horizon itself is built on node and sales which is an MVC framework and I'm gonna dive a little bit into the API for it. So it supports multiple content models with the use of the simple namespacing which is the app label model name and allows access to activities from different viewpoints. You could look at things from the actor viewpoint from an object viewpoint that'll make a little bit more sense in a second. And then target and so on. Here's an example call for you. So we're at version one of the API and if you go to object, YouTube, video one and then favorited, essentially you're gonna look at every activity that is of the type favorited that has been done on YouTube video one. And we're looking at that from the direction of the object. So I'm asking what has been done to me, the object? And on the right you can see that we follow the spec and you see the little data parameter there. That's actually an after effect of Neo4j which we're working to overcome but so we try and solve follow the spec to some degree. Here's some more examples. If I did actor, auth, user one, favorited, YouTube video and then if you go to that YouTube video you'll find something funny. That would return a specific activity as described by the spec. And if you went to the object, YouTube video with that ID favorited, auth, user I would see all the activities done to an object by a specific type of user. And that's useful for accounts for instance. Thing to remember here is that direction actually matters when dealing with graph databases. So if I look at things from the point of view of an actor it's not the same as looking at it from the point of view of an object. I'm asking different questions and actually within the graph databases, sorry, within the graph database edges have direction. So if I'm asking I'm an object, what have I done? I'll probably get nothing because most YouTube videos can't like things. But if I'm an actor and I'm looking at from that direction then I'll be able to get some results. Posting, really easy, API v1 activity. Basically the payload looks more or less like the data that we store. And then we do manipulations on top of that for the pre-computed stuff. And you could even do complex stuff. So we created a controller that's called proxy controller. There's also reverse proxy controller. This facilitates stuff like follow. So in this graph example you have a proxy that could be you. And let's say the proxy verb that I do is followed. And whoever I followed that could be actors or objects or whatever. What I'm gonna get back is a list of all the activities that those actors have done on objects. The return call would look pretty much the same. The return result would look pretty much the same as what I showed you earlier. So then in this case I'd have multiple actors doing the activity. So what are the problems that we run into something like this? First of all we have no control over external data. If a photo changed its title, I don't know about it. So you really need to live in an ecosystem that allows you to get that data back and to inform you of such changes. Second problem that we have is that graph databases don't really have a great ecosystem yet. So adapters are not that great. There's no real graph or, and I know because I wrote some of those adapters. And then front end versus back end computation, I heard that I missed the great talk this morning about where to do the computation. And that's a real consideration. Do you do the entire stream processing on the back end? Do you send it up to the front end to do some of the computation? And that leads me actually to the next part which is gonna be Farhan. Farhan worked on every part of this ecosystem but he's gonna talk about the front end modules and how they relate to this ecosystem. Thanks, Bob. So what are the client side modules? So you see there's the stream and snippet. That's just the names we call them and they're just standard front end technologies, HTML, JavaScript, CSS. And they allow you to communicate with the horizon service by sending actions and by consuming actions. So let's dive in. So the first thing we're gonna talk about is the snippet. It's like a like button. It's a very configurable like button. The snippet is responsible for representing a specific verb that an actor can take on a specific object. And it's also responsible for displaying some state about the activity of a specific object like counts. So here's some representations of those. And it's also an open source process. Please check it out. And this entire thing is built with just vanilla JS. So say I have a web application and say I have an awesome picture of me in LymphartBird. And I want other users to tell me, hey, if I like this, let me know if you like this picture. So on my blog or my web application, I can just add this div and the snippet would be more or less up here. And you can see the div is pretty standard. We're using standard HTML data attributes. There's an object type. And there's again the app label underscore model name, the AID, which is the application ID. So whichever application has the storing this photo. And then the object endpoint. We'll come back to that later. That's really important. And then the data verb. And as you can see, a snippet is kind of a map to an object. And the snippet and the stream have this concept of context. The snippet, since it represents an action you can take on an object, you need to ask the question, well, who's taking the action? And a lot of times it's gonna be a user. An actor of some kind. So most likely it'll be a user, but it could be really anything. And we mentioned it was highly configurable. So the data verb attribute actually maps to a template. And so you can kind of really easily customize how a verb looks differently and even the business logic that encapsulates it. And it's really easy to create your own verbs at the very bottom. Say your application needed a new verb, Pipered, whatever that means for your application. You can very easily add an attribute, make sure you have a template that's associated with it, and then you will get a custom verb. So let's look at kind of how this all works and how they kind of look like the request response cycle. So let's say I have the snippet, it has counts of eight and the user clicks the snippet. This sends a post request to the same endpoint. And this is basically the payload. Again, it's pretty, we try to be really consistent. So it's AID, API, type, app name, model name. The horizon service returns an okay. The snippet is updated with the new state. The heart is filled in and the count is all there. So I just showed you one example of one snippet on a page and it's like, hey, can there be multiple snippets on the page? Can multiple snippets be pointing the same object? Yes, they can. We kind of solve a lot of these issues because of the variety of national geographic. You know, we didn't design all the front end web pages. So we needed to have a way where these can all talk really easily to each other and communicate. So yes, one object can be represented by multiple snippets and multiple snippets can represent multiple objects on the same page and they all work fine. So that's pretty much the snippet. We're gonna talk about the stream now. So if the snippet was actions you can take, the stream is consuming. It's the news feed. It's a highly configurable news feed. It displays activities based on an actor. Unlike the snippet, which is Bill and Vanilla, this is built with just backbone. And again, it's an open-source rebuild. So let's show you what the stream looks like. Right now you're viewing the stream of Lucas Serven, who's one of the developers at National Geographic. And you can see kind of the structure of the stream. Lucas Serven, the actor, favorited the verb, the article digging Utah's dinosaurs. And on NGM is like the target in that case, on some application. He liked a lot of things on July 2nd. I'm not sure why, but I think he was in a good mood. So let's go back to the example of this make up, this is my web application. So let's say some user has favorited a photo and now the stream is showing you what they've done. And this stream is configured to show all of this particular user's activities. He favorited a photo and the most popular activity, which is Django ate some child. Let's get into the response quest cycle here because it's really, it's a bit more complicated than just posting or deleting, sending a delete request. So the module actually uses a web socket connection with the horizon service. The module, because of this we have like a bi-directional communication. So the module can ask, hey, give me all my activities or give me all the activities of people I follow. And we're gonna get back as a payload from the horizon service. Now this is where the API endpoints really come into play. Again, we don't store any, we're not storing images, we're not storing model data. We're actually just storing some API endpoint that we will call out. So the horizon service, I mean the module will call out to all these external applications that your models are on and ask for them, hey, describe this image. So it's really important now to see like why you need these API endpoints. And this is why these snippets in this module can kind of live in multiple web applications. They don't really need to know anything else. They just communicate through this mechanism. And when we get this response back, we cache all this in local storage. So you don't have to be constantly making these. And if you reload the page, the module's on, you're not gonna have to constantly make all these calls out to these external services. We cache all that locally. So this actually brings up a ton of problems and interesting considerations that we wanna stress if you wanna start using any of these softwares. One of them is that the API obviously is really chatty. The module's really chatty. You're supporting an ecosystem where there's multiple web applications. So be aware of that. You're gonna be making lots of calls out. You might be making some cores calls out. So just be aware, architecting that out. And because we're relying on all these external applications, how do you deal with failure? What do you do when the other application fails? What do you display on the stream? Do you display some cache results? These are all really important considerations to think about. I'm just gonna go over a few more. So Ben had mentioned this. This service is not really aware of changes on content. If some photo gets updated here, some video's title gets changed. How do you reflect that back to the service? And if you're already displaying that on the stream, what do you do? How do you change what's being displayed? How do you invalidate the cache? And a really simple example, even like sorting and kind of ranking. Let's say I wanna display the most recent popular activities on a stream and someone generates a new activity. Where does that come out? How do you immediately put that at the top? But wait, you're using an algorithm that's like, hey, I wanna only show the most popular one. So how do you kind of navigate this realm? And then again, a lot of this has to do with how do you invalidate the cache? Basically the point is there's a lot of business logic between the service and these front end modules. And so what you really need to do is determine what kind of updates, models, and things you care about so you can display them to the user. The good thing is that we've built the horizon service in a way where the back end, the service itself, and the client side modules are really configurable and so you have a lot of opportunity to kind of tweak what you need. Is this running live? Yes, it is running live. You can go check out ngmbata.com. NGM is the online version of the magazine where you can see all these issues going back to like 1888. So we encourage you to go on, become a member, and start clicking on and favoring activities. This is what the stream looks like. This isn't released yet, they're still kind of working on this, but eventually this will be displayed on the user profile page. So that's pretty much the agenda, oops. Yeah, that's pretty much the agenda. And one more thing to note that all the projects we've spoken about, all four of them, all four repos are open source. We definitely take contribution, contributes. There's a lot we need to do and we're also building out of her Django horizon app that allows Django to speak because we have so many Django applications in that jail. So that's it. And if you have any questions and I guess we'll supply the answers, but we'll try.