 This is George Gilbert. We're at the San Jose Convention Center at the at scale conference. I'm with Rodrigo Schmidt from Instagram. And Rodrigo has some very big picture and down and dirty architectural details about the design and implementation of the very, very scalable stuff that Instagram's using. So Rodrigo start with telling us for those who don't know the latest experience in Instagram, what are some of the features that your systems have had to support, to support the evolution of the user experience? There are many parts of the product that are powered by our data system. So you can think of the whole explore tab is one thing that is powered by Instagram data systems. The newly launched trending hashtags and trending places, those are products that are again powered by data that we collect in real time, then we process it, we identify the trends, we rank them and then present them to the users. All the way into the analytics itself, like we have a number of analysts that are continuously looking to how Instagram is growing, like the health of our growth and presenting us with statistics on Instagram usage and whatnot. And all that is powered by the systems that we build on Instagram data. Okay, so it sounds like the site data is feeding into, I think what you talked to me earlier about like as a stream and different systems are gonna consume it. So tell me like for the trending feature, where is that the data that feeds that coming from and then what type of analytics do you have to do that and in what timeframe to make that work? Sure, so imagine like somebody's using Instagram on their phone right now. And as they like, let's say upload a photo or like a photo, those requests will go to our servers and our servers are gonna look like, oh, there's something going on, like this person just liked this photo. That generates an action and that action goes into like a stream of actions in real time. We're using a system called scribe for that. So each web server, like the connect to scribe and they say like, this is activity that I'm receiving from multiple users around the world. And that scribe stream is open to any other system within Instagram to plug into read from and start processing that data. So for example, like for our data warehousing, for offline processing, we have some readers that read from the scribe stream and write that data directly into Hive so that our analysts can like look into and process the data afterwards. All the way into, for example, the trending system that plugs to that stream in real time and starts taking that data, processing it, counting how much activity we are getting for each hashtag, for each place, like aggregating that data in real time and servicing it back to other systems that then will rank and surface it to the users. Okay, so let's drill on that last point. If the audience is leading edge IT practitioners and they want that sort of interactivity in real time sort of analytic feedback that trending and to be able to surface that on their sites, what's some of the machinery that they should consider where you talk about consuming, performing the analytics in real time and then surfacing it? Yeah, it depends a lot on the scale of the system that you're building. I think at Instagram, we have a bigger challenge just because we have like hundreds of millions of users like at Instagram on a monthly basis and even on a daily basis. So we need like a much more powerful kind of system. So we have a distributed system that runs in real time and then you have to think about like how you wanna distribute that data, so how to make sure that it's processed in real time in a distributed way. But let's say if you are building something to a much smaller scale, even a simple system with a single machine that aggregates the data like in real time, in memory, that should suffice. We actually try to do something like that in the beginning and it's amazing how much you can do with a single machine these days. Okay, so tell us what are some products that would work in a beefy single machine and then what did you guys have to do that might in the future where volumes are so much greater for mainstream people where they might have to move to? Yeah, so what I would say is like kind of what we did is like we always tried the simplest solution first and I am a strong advocate for that because then you can see, like as you could really fast on your first prototype and you can see where your problems, your pain points are and iterate on top of that. So I would say like one, so you can start with that. I think like if you don't have many users, like if you have like just a few million users or if you're just starting, I think like having a single machine system that just reads from scribe, processes it and outputs the trends should suffice. If you would that be like a stream processor or would it be an in-memory SQL database or a key value store? Just a simple like stream processor, something that like reads from scribe, like it very easy to implement. If you want to get into something like more complicated then you can start using like say an in-memory key value store that might help with a few things. Then if you want to distribute, I would say like instead of doing, we actually ended up building our own system for training. But that was because of our scale and we couldn't find anything that would fit our needs in the market. But today, like these days, there are so many things, like there are so many open source tools that you can use. Like things like storms out there and like while scribe in itself is open source and it's very useful, like it's impressive how much you can do with something as simple as scribe. Okay, we just have two minutes. Tell us, let's switch gears from the real time to the batch analytics. What does that pipeline look like? What are you trying to solve? What analytic problems? So there are many things that we do with batch processing this day. So the same stream that generates these actions that can be accessed in real time that feeds into Hive directly. So that gets stored on a daily basis in our data warehouses systems and that can be used for analysis. So our analysts get information about the activity on how many photos are being uploaded, how many users are accessing Instagram on a daily basis. They can do all sorts of generate second hand statistics on top of that and all the way into other systems that can use that data too. So you can think about the ranking for search or the ranking for suggested accounts on Instagram which are products that we also build on Instagram data. Those come out of batch processing pipelines that run on the data that we accumulate on a daily basis. So those would be fed back into whatever the operational database would be. Exactly. So let's say on the... What do you use for the operational database? So we use some integrated technology, some in-memory database systems that we have at Facebook. And what it does is just on a daily basis, you take the data that comes from the stream, gets into Hive, you process the data in Hive, so there are like many pipelines that run. You generate some data that then we want to get back to the user. So that gets exported into some distributed in-memory databases that we can access from our systems with the data from yesterday. So that goes for a whole day. So the analytics were done sort of batch mode but you want to serve them up in real time? Yeah, like you can think of it that way, exactly. Thanks for watching. This is George Gilbert.