 I've also sort of mentioned the curator that we have in this party. There's been a Tom and me, DJing. That's the Anakaya MV that is awesome to have called the Fantastic. So, what is an event feed? I feel it's necessary to get the common vocabulary down because if you search for events on Google, it'll find lots of different meanings of the word. It's not event-driven programming or not talking about scheduled events or anything like that. We're just talking about basically the news feed on GitHub. It's just a part on the left, just the feed of all the stuff that your friends on GitHub are doing. So, every time they make any change to a repo or whatever. So, there's this paper called Feeding and Frenzy, Seleptively Materializing Users Event Feeds. This is by some people at Yahoo Research and they go through how Yahoo stores their feeds. They actually have their Yahoo RSS thing where they have like millions of people subscribing to millions of blogs and these sources. So, they go through and talk about how they do a Yahoo which is way larger in scale than GitHub. There's a lot of good stuff in this paper and there's a link to it at the end. So, one of the things they talk about is consumers and producers in any kind of feed system. You have consumers subscribing to sources, to producers that produce the events. The interesting thing though is that different apps, they use consumers and producers differently. So, in a RSS feed aggregation service, the producers and consumers are two totally different entities. And then something like Twitter where you have followers, users following each other. It's kind of like the same thing. Users are both consumers and producers. So, that's kind of how GitHub works. Every user is a consumer and then the repos that they push to are producers. We also have multiple feeds that you have access to. There's the user like your personal actor feed which is like all the stuff that you specifically, all the actions that you did personally and then there's the feed of all the stuff that you're watching. And then we have other feeds for organizations. So, anytime someone does something in the GitHub organization, I see it in a special feed that has all the other stuff taken out. And that paper, Feeding Friends, he also talks about the concept of push versus pull in event feeds. It's the difference in how your event feeds are created. So, the first event feed that I wrote in Rails was kind of like this. This is a really basic active record example where I have some event model. It wasn't called events, but I think it was like activity log. Other plugins were back then, like Accident Audit, things like that. And then you had some kind of, you know, your producer. In our case, GitHub is repository of something like a project or whatever. And in some callbacks, and after a save callback, we just create a record. It's really easy to create one record for an event. It's all normalized and my database is perfect. And then we had some crazy query like this to get someone's events. So you can imagine if you built permission systems, if you have events and repositories, and you have some membership table that connects the users to the memberships, and then we do this crazy nested join to get all the events for all of the repositories that I'm in. So it's really easy to write the event. It's not much code, but then to read it, your database has to do a lot of work. And this is, so it's building the event feed on the fly. This is a poll. You're asking, you know, event feed is being built on demand as soon as I access it. So this works on small scale event feeds. But when I talked to Chris, when he rolled out event feeds at GitHub, they didn't do this at all because it falls down really fast. So at GitHub, we do more of a push architecture where, you know, we create the event once, and then we basically, you know, pre-build everyone's feed. So, you know, we create the event, an active record or whatever. And then we loop through each of the followers. And then we create basically another event, an active record, which I misspelled. So the difference in the two events, the first one only has the actor field filled out. And then the second one has the user field. And this is the person that is watching the event. So it's a really simple database layout. And the nice thing is that the events table only has a few indexes on it on the actor and the user. But that runs into other issues. So this is Ashton Kutcher. He's got like six million followers on Twitter. And every time he does something, it's basically, you know, Twitter has to update six million feeds, which is insane. And, you know, this is, you know, Charlie Sheen had the same thing where he signed on, and one day he had like a million followers, which is just crazy. Luckily, I don't know if this is Ashton Kutcher's real account on GitHub. It's the same name, but he doesn't have the same following. So on GitHub we have a whole other problem. We have the John Resig Rails problem. So John Resig, you know, the creator of the jQuery awesome JavaScript framework, he has the most followers of any user on GitHub. And Rails has the most watchers. So if you used to push, you know, do something on Rails, it would just, you know, GitHub would be creating, you know, like 10,000 events or something. Still on a very low-scale repair to Twitter, but, you know, for us it's a lot. So the push architecture, it kind of falls down because it results in, like, a huge increase of data. So this is a quote from the feeding frenzy paper talking about DIG when they moved to Cassandra, and they basically did the same thing where they take their normalized database and they explode out into a Cassandra database, and it goes from, like, tens of gigs to terabytes, which, you know, that's a huge jump. At GitHub, it's not that intense. I think our database, our event database is somewhere in the hundreds of gigabytes. So it's a lot of data. And to keep up with the load, when you have people with thousands of followers, and people who have reposted thousands of followers, we use this plugin called AR extensions. It basically lets you do bulk inserts with MySQL. So this is like, this is a standard insert statement, and then you just basically pass in multiple values. So, you know, when we figure out all the followers that a user or a repo has, and we just build up these big queries and send them off to the database, instead of saying a thousand, we might send, you know, like a hundred or something. Also, when the GitHub event feed was getting popular, we started adding more caching to it. We just memcached everything, basically. So the whole feed, everyone's feed is cached, and then each individual item is cached. And the nice thing about that is a lot of people see the same item. So we cache it once in memcached, and then once building everyone's actual feeds, it can reuse that same cache. And right now we've recently upgraded our memory and our servers and our memcached servers to hundreds of gigabytes. And we haven't seen one memcached eviction in a month, so it's pretty awesome. Everything is in memory. We also reworked the way our templates were rendered. So this is actual code from one of the simpler templates, and it's like this nasty ERB. You kind of look at it, and you're like, I don't really know what that says, and yet they're like parts that were in your head. And it also uses the payload with which each event. So, you know, we used to store, like, for each event there's like custom payloads. So for a follow-up event, we store the target, which is the person that you follow. And we store just the ID. So then back to the template, you know, there's a few spots where we send event target up there. So we have to load the target for each time this event is rendered, which is why we want to pre-cache it. So the template rendering gets pre-cached as soon as the event model is created. And it gets rendered outside of Rails, just in a Ruby job. So I changed that all to... Well, first I de-normalized the payload hash, so we stored everything that we need to render the template. The nice thing about this is your template, you know, the event data doesn't change as the targets change. So if you follow a user and they change their name, or they create more public repos, it doesn't invalidate the caches of all the other events. When that event happened, you know, this person was named this, he had this many repos. So we'd like to stuff it all in the event record. It also makes it a lot faster to render. We don't have to do any database hits. Some of the events have multiple related records, and some of them have to make git calls and stuff like that. So we'd like to cache it when we can. Then I changed the template into a mustache. So we look at this and it's much more readable. And there's no logic here at all. We just have basically property names. And the other nice thing about this, too, is we have all these mustache views for each event that I can test in Ruby. I don't have to scrape HTML to make sure everything works. I can just hit each method and prove the tests around the events a lot. I also sped up the rendering. We used this like hacked ERB. Well, ERB itself wasn't hacked. The way we were rendering it inside the model was kind of weird and it wasn't just kind of slow. It wasn't storing. It wasn't caching the compiled ERB template. So mustache handles all that and it worked pretty well. Stratocaster is my answer to the data explosion problem. So here's Prince talking out. I did a lot of reading on events and how to store them. And I wanted to come up with a simple library that we could basically move off of our existing event infrastructure to something new that became Stratocaster. So the first adapter for Stratocaster that I wrote was in Redis. I built Stratocaster to work with basically any adapter. I just defined a simple API. Redis was the first one that I implemented. If you haven't used Redis, it's this awesome in-memory database that has, well, it's an in-memory database with persistence. And it has data structures, like arrays and sets and stuff, which is pretty awesome. It's very natural to move over from simple in-memory Ruby code to Redis code because you're just adding stuff to arrays and adding stuff to hashes and stuff like that. So instead of storing the event data, like the whole event data multiple times for each follower, I decided to just store an array of IDs or a list of IDs in Redis. So these are the commands to add one event to a list and pull them back out. L push is kind of like, well, it's kind of like Unshift in Ruby. So when you have an array and you just add something to the left or the front of the array. And then L range is what we use to get back items. So that call right there is getting the first 50 items from the list. So with that, we're storing just the IDs and all the repeated events and not the whole event. And in the test that I ran, we actually ran this in production for about a month. And I was running it side-by-side with the current event system and how many events are going into Redis and how many are going into the database. And we're saving about, well, there's about 10 times more events that go into the database that don't need to go. Moving over to this, we'll save like a tenth of order. We'll reduce the event table size to a tenth of what it is currently, which is pretty awesome. So in the real-world test, the first one I ran was for one week, created roughly 18 million rows. This is before Redis, actually. I ran the test with another adapter that dumped everything in my SQL. And so we got about 18 million rows. And I took that data and I decided to shove it into Redis to see how Redis performed with it. So at the time, Redis 2.0 was the latest. And it took about 1.2 gigs in memory, which is about how much my SQL took to sort on disk. It was about one gig on disk. Redis is pretty compact. I think on disk, it was about 80 megs. And then once you load in the memory, it blows up to 1.2 gigs because in memory, Redis has all this overhead with keeping the objects in memory and ready to be accessed. At the time, Redis 2.2 was in beta. And I ran it through that. And the in-memory footprint was just 200 megs. So what happened there was they added some crazy optimizations to Redis 2.0. So if you're storing lists of numbers, which that's exactly what I was doing, the way they encode the data in memory is a lot more efficient. I don't really understand how it works, but I'm pretty happy with the reduction in memory footprint. So the thing is, I had to define my data model. I had to look at what exactly the event feed needed. So one portion of that was creating the event in some database that can access by ID. So for now, we're using MySQL just because we have that set up when we use it. I thought about trying other databases or Postgres or Mongo or whatever. We have MySQL. It's in our servers. Our admin guys have all their stuff set up to use it. So that was an easy thing to start with. And then for the feeds, this is the current Ruby API for Stratocaster. It's still a little weird. They haven't released it yet. It's on GitHub if you know where to look. It's hidden in a toy project that I wrote with no tests. You kind of hide it from Rubyists. So basically, we define what timelines that we want to store. And then we have this key format method that yields a block every time we give Stratocaster a message, an event. And then we try and build the redis list key based on the repository ID. And then this is one for the users. So we're assuming the event gives us the followers. And then we take those users and we build a bunch of redis keys for their timelines. And then here's the code we actually used to store it in Stratocaster. So we create our event like normal in active record or whatever. I made Stratocaster basically pretty agnostic, database agnostic. So my goal was to support active models. So as long as you have like an ID and some other simple things that all active model things will use, it'll work. So it works active record. It works in a toy store, which is this awesome active model. This is awesome model for key value stores using active model. So to create a Stratocaster instance, you're passing the timelines, the possible timeline definitions, and then Stratocaster receives the event and it goes through those previous timeline classes and spits out the timelines that it found. And then also insert that event to those timelines. The nice thing about this is we can be a lot more flexible with what kind of timelines that we use. With the database setup, I have to, you know, I have to index everything in my SQL and then changing those indexes, you know, it takes, you only have something like a billion rows in your database, it takes forever. So just kind of don't touch it. With this, you can decide, hey, you know, I want to do a feed for the network view, the network view of a repo, and I can just add another Stratocaster timeline object and it'll start being populated. So this is like the internals of what the adapters see. So basically they see, you know, the timeline object passes off to the adapters and they see basically, like, here's a key and the event and I want you to store it. And then the adapter, you know, this is a Redis adapter. Oh, then that has the actual Redis code to push it to a list and all that. So why did I build it like this? Basically I wanted to, when I defined my data model, it brought out some simple requirements that can fit to multiple data source. My SQL and Redis were really easy to pick just because we already had those in our infrastructure. And, you know, if we get the, you know, once we get the feed moved over to this full time, then, you know, then we can look at other implementations of other more experimental data source at some point. So one of the nice things about the data model on GitHub is there's no, like, real historical views. I don't have to worry about storing, like, all of your events for all time and making them easily accessible. So Redis works well because we can trim the lists to, you know, like 300 items, which is more than enough for most feeds. And that keeps the memory footprint low. We're not storing, like, all of the event data for all time. We're just storing the most recent, you know, events. Also, it's helpful to limit the scope of Stratocaster. Like, when we're building this, we're thinking, oh, it'd be cool if we could do all this other stuff, you know, be building all this other event timeline stuff, like machine learning and, you know, things like that. And limiting the scope just meant we could build out Stratocaster really fast and just keep it simple. It has no dependencies. You know, you can just get in there and you know exactly what's going on. And this lets you iterate really fast. I've written Stratocaster four times now in Ruby and Node.js. So it's just every time I write it and it works out and I don't, you know, after a couple weeks I just, I don't like it anymore and I should rewrite it. The most recent rewrite came out after a toy store came out. I saw a toy store, how they had the adapters for the key value stores. And I really liked that idea. It was basically doing half of what Stratocaster was doing anyway. So I removed Stratocaster a third time or a fourth time and took that part out and it cut the end of my Stratocaster down a lot. So here's some references. I'll be posting these slides on Twitter soon, like later today probably. So these are some of the blog posts and, you know, like papers. There's the Cassandra one that's from DA that's talking about how they moved to Cassandra where they didn't call it a timeline. That's why it's hard to, like, find these articles because everyone calls, everyone's solving similar problems but they call them fully different things. The friend feed one that's, has nothing to do with timelines necessarily, but it's a really cool look, really interesting way that they're building their own secondary indexes in MySQL for their giant tables. And this last one, using Rianca Yammer. This is a talk I saw just Tuesday by Koda Hale and it kind of blew my mind. He's doing something similar a lot more advanced than what we're doing on the event feed and using RIOC, which is awesome, a distributed database. It's just like a totally different way of looking at data stores. So I highly recommend checking that out. This could be the best anime Jeff ever. I didn't make this one though, unfortunately. Questions? Lots and lots of practice. Oh yeah, yeah, Erin asked. I got so good at making anime Jeffs. Lots and lots of practice. This, the arc of anime Jeffs, it almost died. It was all about YouTube and whatever. And then CanFire brought it all back. You can just drop inches in CanFire chat and now we're all obsessed with anime Jeffs. Sorry. Are you planning to release Stratocaster? I think so. Sorry. Yes, if I'm planning to release Stratocaster. I think so. I want to get the API where I like it and I want to make sure it actually does what I... that it actually works well. I've been using it, but I want to take through its bases. It's also not really much code. I imagine people look at it and are like, well, that's it. That was my question. Yeah, well, I can... I can show you well. So this is All of the Stars. This is a hidden project on GitHub, which I didn't know will find. And it has Stratocaster like the current version as of about a week ago in Vendor. So All of the Stars is a simple app that I wrote to basically catalog the Stars from different apps, like Twitter. Twitter has fingerprints. Campfire has Stars. I think Instagram has likes. And I just want to aggregate all of that. And I'm storing that in React, which really has nothing to do with this talk. But then as I was finishing up this talk, I decided I want to have some example that I can show. So this... Here are a couple Stratocaster feeds that operate on tweets, on favorites. So this one is using the Twitter Ruby Extraction Library that extracts out user names and mentions and extracts without building up. So the key is basically all the Stars are scoped by cluster ID. And it builds a key with cluster ID, hashtag, and then actual hashtag. And then we have a screen name one that pulls out all the mentions. And then when I actually use that... Okay, so I'll just create a new object then. Okay, so here's my tweet. So we're going to go ahead and store the object. And Star is a toy store object. So this long data, this is a store name in React, a key value store. And then I build up the Stratocaster since I'll receive on that Star, I just wrote. And it was up. All right, so we need Redis running. And there we go. So we have all the timelines that it built up. So all the Stars have a timeline for each type. So I can look at... So I can look at all the Stars by type and then I created timelines for the hashtag Ruby on Ales and the screen name. And then I can create... I can query Stratocaster. And I pass in the values needed to build that string. So test and then terminate. So that's our feed. And then if I want to get the first page, I should call page on it. And there's my star ID. So once I get a bunch of star IDs, then I can... So once I have a list of star... of event IDs, then I can create instances like that. And I'm going to show them, you know, render them on the... It's basically it. I'm not at all happy with the API for creating these timelines. It's a little rough, but we'll get there. Anything else? Can we go back to the angry guy? Yeah, he's a... So rustling. I forget who he is. Alright, that's it. Thank you. I have one question. Of these data stores, like, you know, the ones that you've been talking about, are there any that kind of stick out to you as particularly promising and particularly like, or are they all kind of the same? They're all similar. So Redis, I really like... Redis is very different from all the others. Redis is very stable and fast. It's in memory, so it's super fast. Ryok is a really good one. It's a key value store. It doesn't really have any concept of lists and timelines and stuff. Redis is, in memory, I don't know how you guys use it operationally with actually syncing the disk, but you still use MySQL as the primary store for individual events and just Redis for the timelines. Yeah, so we asked about Redis. Redis is in memory database, so it's not the best store for the sole source of your data. So we asked if we used MySQL as the main store for the events. And yeah, we store the events in MySQL and then just CIDs and Redis. Redis is super stable though and you can tweak how often it flushes the disk. I think I can call it flushes every minute and I'll just tweak that. Do you do that instead of the append file? Yeah, I think so.