 This is what I get for using Flux. I don't know if anybody else uses that. I think it still thinks I'm in the United States, so it decided it was going to tone everything down a little bit. You should be sleeping right now. All right, so yeah, so let's just jump into it. So a little bit about TapTroy. Yes, these first two slides are recycled. I apologize, but real quick, what is TapTroy? Well, we're a global app tech company. We're focused on basically providing different techniques and tools for developers to use to power their mobile applications, whether that's monetization, analytics, user acquisition, user retention. So we're a little bit of basically an SDK in a library that people can integrate into their apps, and we help to power and monetize them. We have over 450 million monthly users across 270,000 apps all over the world. We're pretty massive at this point, but it's pretty great. A little bit of technical details. As I mentioned yesterday, we are an early AWS adopter and still use AWS to this day. We grew predominantly on AWS up until the last couple of years. Over 1,100 active VMs currently in our AWS infrastructure as of last month. That number does continue to go up basically month over month as we add more things. We have active regions in Asia, Europe, and North America, and we currently service over one trillion requests annually, which is a pretty cool number, because that's more than 60% than when I joined the company a little over two and a half years ago. So that's pretty exciting. So diving into a bit more of the technical sides of it. Our tech philosophy is very, very compute driven. I remember when I had my first meeting with MetaCloud when they came up to our offices last September, and we were kind of going over what our infrastructure was gonna look like, and we were basically just like, no, no, no, we just need compute nodes. That's it. We just want compute. So we are very, very driven by what we previously used, which was just full on EC2 basically everywhere. We tend to want to operate our own infrastructure. So we wanna operate our own systems, but we don't necessarily always wanna build it from scratch. We have this general philosophy of we don't want anything to be scary if it goes away. So we always plan for nodes to basically terminate at any time and to always have redundancy and resiliency in the event that one of those nodes does terminate. So basically that falls into this idea that all nodes are ephemeral. So the disks that are actually accessible there, if that node goes away, the data is now lost. It's no longer available. It's already replicated somewhere else. The data is actually always distributed in that case. And so failure in that mode is always tolerated. So a node disappearing basically means nothing. The system automatically handles the fact that a node is not available. The data is either repaired or replicated and you're back to basically where you were before the node went away. And so you can either terminate a node if something looks like it's misbehaving or you can take it out of service. It really doesn't matter. In the middle of the night, if we have a misbehaving database node, we more or less just shoot it in the head and it goes away. It's just the way that it works. So I wanna walk you through a couple of the services that we use inside of AWS because I think it's a little illustrative of kind of our methodology about how we approach infrastructure. So we use SQS quite heavily. We find it to be simple and expensive and durable. That being said, we're actually building our own new internal system which is influenced on SQS. I actually sent out a tweet last night and so if anybody in the room actually knows anybody from the Zacara team that's working on the stuff for OpenStack, I'd love to meet with them. I think we actually have a useful addition to the project and we'd like to contribute that back. We basically avoid lock-in. I mentioned yesterday, we have a couple open source projects. Chor is one of them. It's this no lock-in, pluggable back-end queuing system that we use and so we can swap in and out SQS with basically any other back-end that we define at any time. So we're not locked into SQS. We just use it because of the, basically the durability and the simplicity that it provides. We use RDS to manage our MySQL instances. To be perfectly honest, it's just because we don't care about having to manage MySQL instances. We don't really use MySQL for all that much. We're not a heavily dependent transactional database shop. So for the most part, the majority of our transaction, or the majority of our workload is not against MySQL. So it's a fairly simple system and RDS has served us well. So there's no lock-in. It's pretty simple. It's easy, so we use it. If we ever needed to move off it, we could. We just choose not to. We use CloudWatch, but we also use a bunch of other tools as well. In addition to that, we're pretty heavy users of Labrado, which is a great sort of StatsD compatible monitoring solution. But we actually also funnel all of our CloudWatch metrics into StatsD. Also, CloudWatch is free unless you pay for more of it. We also use the ELB, but we use it in a very simplistic sort of way, as we do with all these things. We only use the ELB for SSL termination. That's just because they have really good tools and techniques around basically handling SSL. Behind the scenes there, we actually have our own routing and our own personal load balancing layers as well. The instances that are actually attached to the ELB are actually static. They never change. It's just basically a set of HA proxy instances and Nginx servers. For the most part, we don't use ELB like most people do. We do use autoscaling quite heavily, so we tie it into different metrics to spin up and down our instances inside of Amazon. Our traffic does fluctuate pretty dramatically over the course of the day. It's about a 30% peak to valley flux, so we do leverage autoscaling to basically save money. We also use S3. We're a pretty heavy user of, well, I guess relatively speaking, I'm sure like Dropbox is a heavy user of S3. We're a user of S3. We've got north of a petabyte of data sitting in S3. We basically store everything we can. We have every single request over the entire history of Tapjoy stored in S3, so we can always look back at any point in time and see basically a request that flowed through our system. We also store log data for a certain period of time as well. We do generate tens of terabytes of log data every day, so it's a little hard to actually hold on to that forever, but we do kind of work through a S3 to Glacier migration. So as I mentioned, we basically use compute everywhere, and when I say everywhere, I mean everywhere in our entire life cycle. So every dev has access to either AWS or Tapjoy One, which is our new open stack deployment. So any developer at any time can spin up a VM in either of our clouds and have access to a large number of instances to play around with. This is great for R&D. It's great for simulating changes. It's great for trying to learn about how your application is gonna perform under certain workloads. It's great for basically saying, well, I need to see how this behaves when I actually load in 10 terabytes of data. It's hard to do that on a MacBook Pro. I don't think there are any MacBook Pros with 10 terabytes of data yet. So it's useful to be able to provide this sort of access to all of our developers. So we really heavily rely on this concept of compute is elastic and you should have access to it anywhere. We also do some pretty heavy testing with the algorithms that we're putting into our Hadoop cluster. So we have actually two separate Hadoop clusters that we run in our Tapjoy One environment. One is for R&D and development. And the other one is our core production infrastructure. And we keep those separate to make sure that we're not basically slowing down our normal production workload in Hadoop. We also practice for failure across the board. We do simulate basically what happens if a node goes away or if we terminate an entire cluster or entire set of instances. And so being able to do that in a large-scale environment where I can say, okay, I have 24 nodes in production. If I have 24 nodes and I lose three, what exactly is the behavior that happens once that goes away? So we went hybrid. We're not all in AWS anymore. Despite the fact that we spend millions of dollars there, we decided that it just wasn't going to perpetually fit the use case that we were looking for and it wasn't gonna be cost effective over the long term. So what we decided to do is we decided to pick a particular workload to take over and port into Tapjoy One, which is what we were calling our deployment as we were working on it. So why data science or why Hadoop workload? Well, we kind of had the convenience of they had already actually moved a bunch of their infrastructure around twice over the last year. They were having a bunch of problems with it on AWS. They tried running it on bare metal. They were still having a lot of trouble. We were using a particular provider and they were complaining that it felt like every single time one of their nodes had an issue, they were yanking on a hard drive and throwing in a brand new bad hard drive. And so we were just getting really frustrated with the situation. We felt like it was a bit more low risk than our normal Tier One production services. If your Hadoop infrastructure goes down for 10 or 15 minutes, it's usually a lot more tolerable than an entire cluster of app servers basically disappearing or going away. So we really wanted to kind of give ourselves a little bit of a hedge where we weren't going to necessarily shoot ourselves in the foot by moving our infrastructure kind of all in one go. So we decided that we really wanted a partner to maintain our OpenStack deployment for us. We're really good at operating on top of AWS, but we don't necessarily have that core expertise of how to build our own AWS or how to run our own AWS. I do have basically the entire team that's been working with me at MetaCloud sitting in the front row. And so spoilers, they were the ones that we ended up partnering with to help maintain our OpenStack deployment. Our general philosophy is that we want to build our business and we want to operate our apps. We don't necessarily want to have to go from ground zero to build our own OpenStack deployment. Over time, that may change as the OpenStack community becomes more mature and there's more patterns about how you do that, but we just didn't feel like that was the responsible thing for us to do with the time and the resources that we had available. So what did our timeline look like? So I alluded to this a little bit last year that it took us a year to get up and running. And this is basically what it looked like. So right around April of 2013, we started kind of this discovery period where we were looking into how we would move a bunch of our workloads. And we didn't even know what workloads we would change. We were assessing everything. We completely basically documented what would it take if we wanted to move all of AWS? What if we just wanted to move data science? What if we moved half of AWS? So there was this entire discovery period where I have Google Doc after Google Doc after Google Doc of what we would have done in scenario A, B, C, D, E. We basically put together all these assessments, put together all these ROI estimates on how much money we could save or how much more capacity we could get. And we just basically worked on this for the better part of three months and we talked to every vendor under the sun. It felt like I had more phone calls with people than I care to remember. I didn't feel like I actually got any work accomplished for months, but that's sometimes how these projects go. Right around September of 2013, I flew out to San Francisco with my counterpart on this project, James Moore, who I mentioned yesterday, who helped to design a bunch of the infrastructure. And we basically made our first internal pitch to the CFO, the Chief Product Officer, and our CEO at our company. It went actually surprisingly well. I think when you're presenting sort of the options that you have for moving out of an AWS and or moving off of kind of the bare metal servers that we were moving off of, they can definitely see the value, they can see the return on their investment, they can see where their money's gonna go and why it matters. So from a pitch standpoint, I think it's actually a fairly easy pitch to make, especially to an executive team. Internally, in terms of an engineering pitch, I think it can sometimes be a bit harder, right? You can have a team that's more entrenched in AWS or in the technology that you're already using. And so convincing other engineers that no, no, no, we do need to make a move can be a lot harder. We basically locked down our hardware choice. We vetted it with Metacloud. They basically said, yeah, we think this will work fine. Kind of shrug their shoulders. They were like, we'll be fine. It'll be good. And so we went for the board. We basically pitched everything to our entire seven, eight person board. We presented it. We got the thumbs up in the meeting and we were a go. So we basically locked in all of our vendors, started all the contract processes, getting everything signed, lawyers talked to each other, lawyers complained at one another, finance people complained at one another. And two months later, yes, it took two months to get all of our contracts sorted out. We actually had hardware ordered and basically our 2014 budget outlined and confirmed. So we ordered in December. We were supposed to have our gear shipped sometime in early spring. I think the original date was end of January, which then became end of February, which then became the end of March, which then finally became the end of April. And we actually got our gear delivered, which was pretty frustrating. That's probably the one of the biggest downsides to ordering your own hardware. It does take time. In our case, we ran into a part shortage. There was one part that was missing in our infrastructure and one part is all you need to have it not get delivered. So that is definitely one of the challenging parts about moving a deployment to something like an open stack where you're running your own infrastructure and you're ordering custom infrastructure. You're not basically starting with something that somebody else has already printed out and you're just kind of stealing it off the assembly line. There are companies that do do that. There are ways to basically guarantee you're gonna get infrastructure in a six to nine week window pretty consistently, but 16 weeks give or take, it doesn't sound like that's totally not par for the course. We were probably on the wrong side of the mean, but these things happen. So by the end of June, we were completely green. Our entire open stack deployment was up and running. It had been tested, beaten, basically benchmarked up the Wazoo. We started moving all of our data over into our basically our Hadoop infrastructure and two weeks after we went green, we were alive and completely cut over. Took two weeks with zero downtime. I honestly, I find that to just be amazing. We moved 180, 190 terabytes of data in two weeks. It was really good. And the guy who worked on that, his name is Robin Lee. He did a unbelievable job doing this. And I hope he got a nice big bonus for the hard work that he was doing because he was doing it from China at that time. And so we're coordinating from the East Coast of the United States with China on a daily basis. So as I mentioned, the vendors, it really did actually matter for us. We had two really good vendors that worked with us throughout the project. MetaCloud, as I mentioned, my little cheerleading group in the lower right hand corner. They helped to verify our design, power our open stack deployment, provision our network. They actually worked on the entire network design for us. That was a little bit of a one-off for them and I know it's been a challenge. So I really appreciate the hard work that you guys did on that. I'm so sorry. And basically it allowed us to kind of focus on what we were more experienced with, which was spinning up the city of infrastructure and our core applications. We also partnered with Equinix on this stuff. They were really great on helping with the cooling and power designs because as I mentioned yesterday, we run about 17 KVA per rack when it's under full load. So it's a fairly heavy workload, especially for the facility that we were running in on the east coast, which is roughly about a mile away from AWS on the east coast. They also have some really good remote hands and they really went above and beyond on a couple of occasions, especially with the hardware delays and made it a lot easier on us. So the kind of full list of the people that we use, MetaCloud, Equinix, Qantas applied all of our hardware and our gear, which I saw a couple of people picked up on yesterday during the presentation. We use Cumulus as the operating system on top of the switches. We have level three, which is our IP transit provider and anybody who's tried to build a colo, you're buying random cables from Newegg. And we were no exception. So some of the challenges, as I mentioned, delays killed us. It really did. It kind of hurt our budget. I'm still paying for it today. I'm looking forward to a fresh budget in 2015. Setting up the IP transit can be very slow. It's a surprisingly long process to get a link set up. I don't know why. Somebody's got to figure out how to do this more efficiently. I just don't understand it. Maybe companies have been watching how Comcast and Time Warner do cable setups for home delivery and they think that that's okay for business too. I don't know. It doesn't really make sense to me. We don't actually have a physical presence in DC, which is where our colo is. We have Boston, Atlanta, San Francisco and Seoul, South Korea. So anytime we actually needed to go down to the facility, we had people either taking the train, driving down or taking a flight. So this is both a pro and a con. I actually like that we don't have anybody down there because it means that my guys aren't actually having to go down to the cage at three o'clock in the morning to go fix an issue. But also that comes from how we design our infrastructure in that we don't care if a node goes offline. We don't need to swap out a hard drive in the middle of the night. It just goes offline and it's okay and we'll take care of it tomorrow. Or in our case, sometimes we just take care of it like a couple weeks later and we just kind of bundle it all up and take care of it whenever we need to. We have yet to have a reason why we even have to file a remote hands ticket in the middle of the night. So I actually like the fact that we're kind of separated from a distance perspective, but it can make things a little bit tricky when you want to get something done and you need to file a ticket for it. So the other challenge that we had, as I mentioned before, is that we really didn't have an internal success story around OpenStack and doing this sort of a migration. And so there was a lot of skepticism inside the organization. Engineers tended to be, as I said, the biggest skeptics. I remember the day that we actually went live. I was mentioning it to an engineer, or no, I'm telling the story wrong. I was in the office and we had made some sort of press release. I think it was, maybe it was with MetaCloud that we had just gone live and I had one of my coworkers ask me, why did you just lie about the fact that we're running OpenStack in our engineering organization? I was like, no, no, no, we actually are. The engineering team in general just kind of, it was very blasé, it was nothing. So for the team that knew about it, they asked a lot of questions for the rest of the engineering org. It's just another set of compute nodes so they didn't even really care or know that it existed. So it's a little bit of a challenge to try and get people to get bought in. So what parts were not so glamorous? A lot of this was kind of the annoying long tail part so negotiations can be long and exhausting. I, as I mentioned, I was on the phone all the time for the better part of six months. And as an engineer, that's not exactly what I want to be doing with my time. So the turnaround time on getting these things processed and getting contracts signed and getting lawyers to review things can be very frustrating as an engineer. You just kind of want to move forward as quickly as possible. So I think there's room for improvement there across the industry. You probably need a Gantt chart. And as an engineer, I don't want to have Gantt charts. I have fortunately an engineer who works for me now, who wasn't working for me at the time, who loves Gantt charts. And so if, when, I shouldn't even say if, when we do our next deployment, he's got a Gantt chart that's already actually ready and made for all the different things that we need to do along the line. And to be perfectly honest, there's nothing agile about writing a big check for a bunch of infrastructure, right? That's part of the beauty of EC2 and any sort of cloud service like that. It's nice to be able to spin up and spin down VMs as you want, because you're not writing a check when you do it. But at the same time, there's something smart about having to plan through all of this and actually know what you're doing with that money and why you're spending it and how you're gonna use it, which you don't necessarily get with a normal cloud provider. So what did we build? Well, I showed this yesterday. I'll show it again. We basically built 348 all-purpose notes that we basically came up with. These were meant to basically serve Hadoop and HBase for us primarily, but were also meant to basically be recyclable in the event that we decided to move this workload off of it after a year or two. We're actually still assessing this right now. We may actually repurpose TapDroid 1 for a bunch of app servers going forward and we may actually build out TapDroid 2 and move Hadoop and HBase over to the new infrastructure. We're keeping our options open. We haven't really decided how we wanna do it yet, but we kind of view that Hadoop infrastructure is actually being somewhat in flux every 18 months or so and the app servers are still relatively static. So we're actually kind of viewing this as a rolling cycle. We're actually roll our infrastructure forward to the next deployment. So we'll see how we end up using it going forward. We also have those 12 management notes that I mentioned before. These are just sort of your larger instances, more RAM, more SSD, and then a bit more network as well. This is the glamor shot. I should have included this yesterday. I don't know why I didn't. Those are actually our three racks sitting in the facility on the East Coast. I actually haven't got to see them in person. I didn't have to. We originally had a goal where we didn't actually ever want to have to go to the facility ever once. We ended up having to go down, I think twice, plus Metacloud made a trip out as well, but for the most part we haven't been down there since June, which is quite nice now. So that's the three racks. So this is sort of a graph that shows where we were and where we went to. So the old blue line is basically the infrastructure that we replaced and the red line is now our new infrastructure that's in TapTroy 1. So you can see depending on the metric that you're looking at, we basically went up in some cases over 10x in terms of our actual capacity gains. We spent the exact same amount of money as we were spending before. We basically traded monthly OPEX for an upfront CAPEX investment. And for those of you who don't know, operating expense versus like a capital, upfront capital investment where you're basically outlaying a bunch of cash. And this is what we got. So for basically throwing a little bit of upfront cash at the problem, we got huge gains in our infrastructure. I mean just absolutely massive gains. When we were trying to basically put together information about if we used AWS for this, if we used SoftLayer, if we used Rackspace, if we built it ourselves, what would all this look like in terms of cost? Nobody even came close to us building it ourselves. And by the way, when I say the same price, I mean same price all in. That includes the head count that we're basically counting for in MetaCloud. So a lot of times when you're building out this deployment, you're not including the fact that you're gonna hire two engineers or you're gonna basically have two and a half full-time engineers working on your problem. We actually took all of that stuff into account when we were putting together all this information. So I'm very proud of what we were able to do here. It wasn't the best configuration that we came up with, but it was the best configuration we came up with with a healthy hedge against it with this recycle nodes that we came up with. So a lot of the times you'll see people who build their infrastructure basically down to every last bit of like power and raw CPU that they're trying to eke out just so they can get the most performance out. We didn't quite go to that extreme mostly because we wanted a hedge. So some diagrams, just cause I wanna show you all kind of what our infrastructure looks like, how it behaves, how it works, and now how it actually works with our OpenStack deployment as well. Let me make sure I'm not, whoop. All right, so we have, as I mentioned, we have a fairly large AWS infrastructure still in place. So it basically starts with our mobile networks. So we're, as I mentioned, we're a completely global company. We see users in almost every country in the world, probably except for North Korea. And I think we actually have seen North Korea once. We have, yeah, he's saying once, we've seen once. So Kim Jong-un does like Tapjoy would be my assumption. So basically all the mobile networks flow into us and they hit our regional endpoints. So we actually set up SSL termination endpoints all over the world to basically bring SSL termination as close to the end user as possible. I don't know if any of you work in an application that's actually very heavily used by mobile devices, but SSL is really painful over mobile because you eat the latency between you and the tower. So what you always wanna do on a mobile device is basically reduce the number of hops and reduce the latency as much as you can and basically bring yourself as close to the device as you possibly can. So we have edge nodes serving our SSL termination. Those regional endpoints then flow down over a persistent SSL connection to our primary load balancers which are all housed in US East. So we're primarily on the east coast of the United States right now. Those tiers then flow down to a routing layer. As I mentioned before, we use HA proxy and NGINX quite heavily internally to basically load balance and to also route traffic to different clusters inside of our application tiers. So underneath that routing tier is server cluster one through N, I don't know, you were probably up to like 20 now, 18, 20 something like that. It's a fairly large number and that's all being handled basically at the routing tier. So we're doing matches based off of, you are all parameters, DNS, there's a bunch of different things that we're doing. That then depending on the cluster will flow down to this VPN service proxy that basically routes traffic down to tap to a one whenever necessary and I'll show a little bit more of a detailed diagram there. So what does that look like? Well, at the level of an app server, you're basically running a core service. So in our case, let's say an offer wall or you're showing a display ad or you're showing a video. That service is serving up requests to the end user but what we need to be able to do is we need to be able to optimize the response that we're sending back to that user. So for Tapjoy, we're a rewarded advertising company. So what that means is that we're not just showing you an ad that you're then gonna look at and we're earning based off of the 1,000 impressions we get. We actually earn money based off of whether or not you choose to interact with our ad. So not even click on the ad, actually have to interact with the ad. So a lot of that is you watched an entire video or you complete a survey or you actually play a quick little game on your screen. All of those things we basically count as an engagement and that's how we earn our money. And in return for engaging with an ad, a user earns currency in their game or earns some sort of credit or unlocks a level or whatever the case is. So we're always trying to choose the best possible ad because that's the right way to get users to interact with their system is to have it work well and to show them things that they actually wanna interact with and care about. So all of that flows through this optimization service which is powered by Tapjoy One. So what you can see is that we have a service that flows through this concept called a circuit breaker and we call it a circuit breaker because we never want to assume that whenever we're kind of crossing service boundaries that that service is gonna work properly. And that is a great design pattern to follow wherever you possibly can in your code and in your infrastructure. Never assume that the other side is always healthy. Once you start to assume that that's when your services start to go down all the time because when you start to have 20 services you're increasing the surface area of failure. And as you increase that surface area of failure, failures will just start to happen more frequently because that's just how it works. So what we do is we basically have the circuit breaker watching the traffic as it's flowing through to this endpoint which either then hits a cache or hits our VPN tunnel and goes off to HBase to basically make some determinations about what is the right ad to show to that user. So we actually do user level targeting. We're actually looking up on a per user basis what is the appropriate ad to show and then we return that value back. Now in the event that that endpoint is for some reason unavailable because the internet, we actually flow to a fallback database which has a cached version of some list that we'll work through which are relatively well optimized but not necessarily the best possible situation. That fallback database is constantly getting populated and worked on actually by TapTroy 1 behind the scenes. There's actually then another fallback that I didn't even draw on this diagram where if that fallback database isn't available we actually have another fallback in place there as well. So you can see we just kind of layer on these multiple layers of fallbacks. So in the event that anything isn't behaving properly everything just sort of keeps running. So it's just something that I think is very useful to keep in mind and we try and preach that internally as best we can. So the other component that we're using TapTroy 1 for primarily is this data pipeline. And so for us it's all about Hadoop and HBase and basically big data and all the algorithms that we run. So all the mobile devices connect to our app servers. We generate all this data as I mentioned we generate tens of terabytes of data every day. That flows through a data aggregation service that we operate internally called our Reaper system. That Reaper system then dumps data into a BatsD deployment which we use for some internal monitoring and detection services. It also flows down into S3 where we do our bulk storage for all of that data. And then TapTroy 1 actually consumes all that data from S3. So it's reading all of that information in processing it, ETLing it, loading it up into HBase into a more normalized format. And it's also dumping it off to a Vertica cluster that we operate behind the scenes as well. You can see that I kind of drew the box for TapTroy 1 around Hadoop and HBase and not Vertica. Vertica is probably gonna be in TapTroy 2. We've been sort of putzing around on it for a while but we're looking to move that out as well. So, as I mentioned, you should plan for failure. I assume we have some Ceph users potentially in the audience, EBS users potentially. You'll notice that I haven't mentioned EBS or Ceph once at all in this entire discussion. We just don't use it. We just don't. That may be a little bit naive on my part and on my team's part. But as I mentioned, we try and think of everything as ephemeral and whenever you start to rely on the fact that I've written something to disk and that's being basically replicated somewhere else via that disk, right? And I need to be able to remount that disk somewhere else to recover the data that I'm working on. I get really nervous. I don't know. I'm not sure why. I tend to trust application replication more than disk replication and having to go through the process of remounting disks. It also basically means that it takes more time to recover your information. Amazon also notoriously had a horrific EBS outage. I guess it was three years ago, three and a half years ago. And they periodically have blips on EBS to this day. It's never felt like a really reliable service to me. And so when I talk about building services for our own infrastructure, we just don't use it. We tend to focus on replication up a layer for us. It's at the application layer. So always think about service boundaries when you're defining this stuff. So have hardware contingencies, have software contingencies, run a backup link, which is a little funny for me to say. Use temporary caches in the event that something goes down. Always have a fallback plan, no matter what. So if you're building your own new open-sec infrastructure, make sure you have a clean fallback plan when anything actually falls over. Because it will. Things break. That's just the nature of engineering. Nobody's ever built a 100% reliable service that's run for years and years and years. Or maybe they have and I just haven't heard about it. Because Lord knows Google certainly has it and they really do try. So my info as it was yesterday. Dusty West on Twitter. My email's west at TapTroy. And I figured I would leave kind of the last 10 minutes for anybody to ask any questions they might have or just generally chat with me afterwards. So thank you very much. And let me know if you guys have any questions. Thanks. Yeah. So Nicky asked the question of why not run it, why not run big data on bare metal basically, right? Yeah, we actually got asked the same question when we were designing the infrastructure as well. To be perfectly honest, it really just doesn't matter that much. It just, maybe if we were trying to, as I mentioned earlier, eke every last bit of performance out of every single node, it would matter. But we scale everything horizontally. We don't scale things vertically. And so when you're talking about bare metal versus virtual, you're talking about scaling vertically, right? You're trying to get that last little bit of performance out and you're trying not to take too big of a hit. Now, you do eat some extra latency, right? When you're talking about the virtual layer. And so things are taking more nanoseconds than they normally would. For us, it just hasn't been a problem at all. We moved from virtual to bare back to virtual and the most stable deployment that we've had to date has been TapTroy 1. It wasn't our bare metal infrastructure and it wasn't AWS in terms of how it's performed. TapTroy 1 has been significantly more stable for that group. And so from my perspective, I'm just sort of of the mindset of you just virtualize everything until you get to that point where there's just such a big compelling reason why you can't do it because you're running some gigantic Oracle infrastructure and it's gotta be using every last bit of terabyte of brand that you have available to you. We just haven't run into a reason why it matters for us. Now, that doesn't mean it doesn't matter for somebody else and that's totally fine. But for us, the management of having a virtualized environment is so much better than that last bit of performance that I get out of my dedicated instances. Any other questions? Yep. So the question was, what do we use to replicate between application layers? So we actually do replication in the application. We do replication with whatever systems we're running. So one of the big systems we operate is React, for example. So React is a distributed NoSQL store similar to a Cassandra, fits that Dynamo model. So whenever data is written, it's automatically written to multiple locations and it doesn't respond with an affirmative that it has written that data until it has some sort of a quorum, right? So that means it has to be written to K locations. That K could be one, that K could be three, that K could be 10. You could basically define your replication factor. So applications that we use, that store data and that retrieve data, we specify a replication factor. So that's actually happening at the moment that data is being written and or read. So in the case of React, if I lose a node, it can either actively repair that information behind the scenes or the next time a request comes in for one of those objects or that piece of data, it repairs it automatically. So that's when I say that we use it at kind of replication of the application layer. That's what I mean. Rather than relying on, basically an EBS volume beneath the under the hood or a CEP volume under the hood, where if that instance fails, but the drives have not failed behind the scenes, running CEP, we can just bring up a new instance and remount the drive. We don't go through that process. We just terminate the node and go ahead. We just walk forward. Other questions? Yep. Two questions. One is that already. Yep. So I'll answer the first second question first and then go back to the first one. So for us, we did everything over a three year cycle because that was basically the warranty agreement that we have on the hardware. So our gear, we have next business day repair on all of our infrastructure up through three years from the delivery date. So sometime in May of 2017, I think. And so we basically did a one to one. So assuming we are infrastructure only lasted to the day that our warranty expired, that is our comparison. So when I talk about ROI, I'm not even cheating in saying it's a five year ROI on buying my hardware versus three year lease on cloud infrastructure, like you would with an Amazon heavy reservation. So if this gear lasts for five years, it's just house money at that point and it's great. And the gear that we have there is gonna be, you're gonna have more hardware failure after three years, but in terms of performance, it'll be fine for a big chunk of workload that we're gonna be using. And since everything we already build can tolerate node failures, we'll just run it into the ground basically. The first question, I'm actually probably the last person you should ask about the networking stuff. James, a guy by the name of Adam Bell, James Moore and then a couple of the people sitting in the front row worked on a lot of the networking. So if you wanna ask about the networking, Rafi might be a good person to ask about it after. Yeah, after the talk. There's also another Metaclot employee who did a ton of work on this stuff. His name is Dave Pippinger. He's sort of been our dedicated sort of network guru and general Metaclot guru. He did a bunch of work on that as well. We actually have a transition that we're working on as well for our networking infrastructure. So there's kind of what's running today and what we're gonna be running hopefully soon. So it's a little bit of a flexi question. So not trying to dodge it. But in general, we haven't had, I don't think we've had any issues specifically with Cumulus. Rafi could probably speak to that. He's shaking his head. No, I think we've been fine. Other questions, yep. So Metaclot kind of chooses that for us. I believe when we deployed we were using Havana. Is that right? Sorry, Grizzly, Grizzly. So it's basically at the behest of Metaclot and how they wanna do the rollouts. For us, as you kind of saw, we're very vanilla, right? We're very kind of plain. And so from that perspective, a lot of the upcoming releases matter probably less to us as they do to the rest of the community. There are certain things that I'm sort of excited about. I'm excited about some of the managed database stuff that's getting worked on. I'm excited about Zocara, which is why we wanna contribute back to it. There's a bunch of those peripheral services that we use inside of AWS that are now starting to hit that one dot O cycle inside of OpenStack. So as we sort of progress, I'm looking forward to that stuff happening. But for us, it's compute. And so compute works quite well on Grizzly. It works quite well on Havana. And ArcGrizzly is a good one. And so, yeah, I said, I called it the MetaCloud variant yesterday and I got teased a little bit on Twitter about it. I'll call it tailored by MetaCloud. I don't know, it doesn't matter. Yeah, for us, the stability of OpenStack itself, the actual deployment of OpenStack has been superb throughout the entire time it's been running. So you tend to look at upgrades as you're looking for a very specific feature or you're looking for more stability. Well, we don't have stability problems and at the moment, we're not looking for new features. So we tend to be probably more on the slow to upgrade component. We just aren't in that sort of, I wanna upgrade everything every six months sort of cycle. I wanna keep up to date, but I don't necessarily need to be upgrading constantly, which I think are two different things. So when we run our infrastructure, I'm keeping my database versions as much up to date on the minor release versions as I can. And then we're doing a major release maybe about once a year, just because there's so much testing and sort of development time that goes into it. And you don't wanna be cavalier with a massive infrastructure. We just can't do it. If we were a six person shop and I had 20 nodes in EC2 or something like that, yeah, I'd probably be more cavalier about it because, okay, I go down for two hours and just roll back and it's fine. I can't eat two hours of downtime. I just can't. So I get three nines a quarter for my entire infrastructure everywhere. So it's hard. Other questions? Anybody? Yeah, so that's a good question. So when I say we're heavily compute focused, I'm obviously talking about the fact that we're operating a lot of our own systems on top of compute, as opposed to kind of using a lot of the infrastructure that maybe something like an AWS provides with something like Kinesis or DynamoDB, things that are very lock in, it's kind of centric. So we build our own systems on top of compute. So your question about this data, yes, we absolutely could use a significant more amount of storage long-term and it would change kind of the how we think about operating engineering going forward. I actually mentioned in my talk yesterday that I'm looking forward to seeing if anybody does anything with the backblaze pods, the 180 terabyte pods that just crushed the pricing for S3. The problem is that it's harder to make a use case for why that's ROI positive to add a lot more storage. So storing more logs is not necessarily something that you necessarily derive immediate benefit from. Something that the engineering team wants is better search around our logs as he was shaking his head up and down in my engineering counterpart. We wanna be able to add things like full elastic search over all of our logs over our entire system. When I priced it out earlier in the year, it was gonna cost us something like $20,000 a month to operate elastic search for our log data and I got a day of retention. That's just sort of the volume that we deal with on a day-to-day basis. And so when you're talking about saying, okay, we're gonna add 10 terabytes of storage or we're gonna add 100 or sorry, 100 petabytes of storage or 10 petabytes of storage. You've gotta have an ROI on why you're doing it. So either my data science team needs it because they need more data. And at the moment they say they don't. We basically ETL that process or we basically ETL that data down to a distilled value that can be stored in the hundreds of terabytes as opposed to the petabyte range. And then we just use S3 as that long-term storage basically and so it's glacier is great because it gets even cheaper and we don't tend to need to go get that data out but we can hold on to things forever. Long-term, yeah, I can't imagine we're gonna be under 50 petabytes in more than a couple of years. We're gonna be way past that, I'd have to imagine. Did I answer your question? More or less? It's a good start. It's a good start. Any other questions? Yep? Yeah, roughly on a day-to-day basis we're transmitting that data over after it's been pre-processed a little bit and then stored in S3 and then aggregated and then ETL again basically. I think we burst up to a couple of gigabits a second, or sorry, we burst up to a full 10 gigabits a second during normal load but we're probably averaging somewhere around 800 megabits to 900 megabits a second through our primary link and most of that's just consuming data out of S3 continuously over time. I'm not exactly sure how many terabytes that is a day but it's somewhere in that ballpark. Yep, we're working on it. Yeah, we haven't done it yet. We're gonna do it in early 2015. We'll probably move a fairly big majority of it because the pricing's fine. The pricing isn't painful and so just the latency matters there. We do run into issues where the public internet flowing between our two facilities despite the fact of how close they are together. There are choke points in the AWS infrastructure and if you're not over a direct connector you're gonna hit them going in and out and so we hit these big kind of 60, 80 millisecond lags between the two pieces of infrastructure and that sets off alarms in our system. So we want the direct connect so you get a better SLA. So I think we've got a bunch of people waiting to come into the room so I'm gonna call it but thank you everybody.