 Ladies and gentlemen, please welcome Vice President of Technical Operations, Shutterstock, Chris Fisher. Hello everybody. I'm gonna do a quick test flip on these slides. Cool. So, for everyone that was here during the keynote, I talked a little bit about Shutterstock in my role there, but I'll reiterate again is I'm the Vice President of Technology Operations. Effectively, what that means is I'm responsible for managing our web ops, dev ops, data infrastructure, database, and essentially all of our operations teams as well as kind of owning 50% of our core services and development architecture. So, essentially, I'm gonna skip past this one, to give a little bit of context again on the environment is in this talk, I'll go a little bit deeper, is that at Shutterstock we're really deploying a lot. We're pushing tons of code out to the Internet between four and six hundred different times a month. Every single developer in our organization is able to ship to production, and we're doing all that with literally only two weeks ago. Did we move into like a fully built CI environment? It was pretty exceptional that we've run for ten years with a lot of really talented devs pushing code live without a full CI type suite in Jenkins tests or other unit tests running before every deploy. We've got dozens of services. We're literally moving on the back end 10, 15 gigabit a second of just internal service communication between all the nodes that talk on our network, and it's written in a lot of different languages. So, we won't really drum into this too much, but a lot of the need for OpenStack at our organization is because we needed a lot of metal to do all the types of processing that we do. We're working with images and video, tons of transcoding, lots and lots of storage, and we needed to be able to do it in a highly competitive way from a cost perspective. So, a lot of the things that now exist, you know, S3 and other cloud services, you know, frankly just weren't going to be, you know, viable from a costing perspective in the long term. And so, we really invested pretty heavily in being able to store tons and tons of data, work with that data, and do so in a way that we're not having to make trade-offs of can we keep this, can we not keep this type of data. One of the biggest things in running a fleet this size when you get to 1,000, 2,000, 3,000 nodes is that you really have to commit to full-on automation. A lot of people talk about this. I'm not talking about a script. I'm not talking about one thing that you roll out. I really do mean building a platform that you can automate against in that it's a framework. Developers work with, contribute to. You have a standardized way to automate processes and jobs and things that are scheduled to happen at particular times. And just knowing that once you've built that logic in, you're never going to have to approach it again. So, this is like my favorite slide just because these are the goofiest images ever. But, you know, ultimately every systems guy in our team, they really insist like, yeah, I love automation. It's great. I love to write scripts. You know, but they're all still kind of internally a little bit afraid of like, what happens if I, you know, really automate this now? Like, what am I going to do next? And it's been a lot of work to build a culture inside Shutterstock that's focused on, you know, the more things that we can automate, the more things that we remove from our lives, the better we're going to be able to approach a new set of challenges or a new set of ideas. So, this is kind of at the cornerstone of how we approach, you know, our operations groups is we want everyone embracing the idea that if we can take things that we usually would have to do manually or even run a one-off script, get away from that type of world, everyone can build real software. And we've got some pretty highly functional pieces of software that our operations teams have actually written that do everything from, you know, one click provisioning between cloud and metal and OpenStack to things that are doing big, big analytics transactions or working with our big data structures. We collect so much data just from a monitoring perspective that we persist all of that into Hadoop and HDFS. It's something like 20 million data points per every minute. And so you can kind of imagine the structures that you're creating. So, from a compute side, and I'm going to go on here into a little bit more of how we're actually building out our platform. And since this is a little bit larger hall than I had imagined, I'll guess I'll do Q&A outside if anyone has questions later. Ultimately, from a compute side, OpenStack does all the real heavy lifting. What I'm defining as compute is things that run as a virtual machine or a virtual instance where we don't require physical hardware. We essentially have a thin provisioning API that we wrote called Optopus. It's actually open source. I think the open source version is lacking a lot of the features we've recently developed, but we will be pushing some of those things back into it. So feel free to do a Google on it and you can check it out. What that does is it can talk to either foreman or something that can provision physical hardware. It talks to our storage nodes to actually be able to set up more storage components or it talks to OpenStack to provision more VMs via interacting with the API. So this is how all of our developers as well as operations groups, we really get to work with an API layer because not only can we go into this and click on things, but we essentially can use the API we've created in foreman to orchestrate our platform. All of the VMs, this is one other unique component, all of our storage for the actual instances that run in OpenStack are stored directly on disk of the hypervisors. We don't have any centralization. We don't have a say on our NAS to do that work today. This is along the idea that we really wanted to try and challenge ourselves to work with things in an ephemeral way where we don't really care about the data that exists on a particular node. When you spin it up, it's there for a particular point in time and then when we delete it, we shouldn't be losing anything. This is kind of a key component when you're doing a single tenant multi-thousand node type of deployment. We're probably going to have to start playing with a little bit more block storage and start getting to the point where we can have some persistent disk because now we've got things like solar indexes or large components of our data structure that we want to make more cloud-like and we're going to definitely need some distributed storage for that. So I'm going to talk about that in a little bit more depth in a minute. This is just a picture of Optopus. As you can see, we are not UI designers. But ultimately, what you're kind of looking at is this is the number of nodes created and I think this goes over a span of just a few weeks. But ultimately, you can see the blue line going up is just literally number of running instances. At the end there, it says it's 1358 and that's for a single data center. You can kind of see fun spikes like the big blue spike on the top, the number of events per hour as we went through kind of a storm where we broke our messaging system and effectively when everything turned back online, every one of our nodes started talking and creating events at the same period of time. So this type of instrumentation, this is just discrete events, things like code pushes or bringing up a new node or shutting down a node. This type of instrumentation and visibility has been really, really critical in our ability to run our environment and it's this type of development that we really want to get everyone to be working towards. We have this idea of high IO nodes. These are still dedicated pieces of hardware that are provisioned, again, via an API but it's kicking off a job to form it. I'm sure a lot of people here have worked with Forman or Cobbler or some other pixie boot environment and we use this for high IO databases as well as some of our dedicated search instances. So in environments where we're working with a large number of data or the data needs to be truly persistent for a long period of time, we have this capability and again from the development perspective, when we go to set up a multi master MariaDB cluster or something in circular replication, people just literally do an API call and a hardware cluster of those nodes is spun up for them. We've got a lot of hope that in the future we can take some of the code we've written and just integrate it back into working more directly as either part of OpenStack or something that's very tightly coupled. So this is one other kind of thing that we want to get into and we're going to talk about a couple pieces here of storage because we're so storage heavy is there's a couple different types of way that we're deploying storage in the environment. We have the system called Mogulofes, which is effectively, if you look at this little diagram, not the actual chassis, is that we've got a system that has a MariaDB database that stores metadata, literally a server location and an actual disk that sits on a Linux server that stores a number of files. Literally, that's it. It's not really super complex, but this system right here is six petabytes and stores over a billion file IDs. With this, we can do things like this super dense, super cheap kind of storage. This is a backblaze pod that we pulled from their site, or we can put in things like SSD tiers or storage that essentially has a higher IO performance. So we've got kind of a mix of long tail stuff that sits in this and then a higher tier performance or things that allow for us to put in different types of faster spinning disk using a combination of co-raid ZXs or SRXs or other things that allow us the ability to interact with a more robust storage appliance. One other kind of thing to call out here is you can see where it has downloading photos or managing mogul is we contribute to this code a lot. It was originally written by Brad Fitzpatrick, who's the creator of Memcache. Now it's maintained by some guys on that project and frankly us. We're writing a lot of different code that contributes to this project. With it, we're able to scale out the ability to query our data sources individually from the ability to check and make sure all these files are in the right place the right number of times. So the next piece of our storage interface and again all this stuff is something that we can drive via OpenStack or via our API is that we store all of our large volume storage or solar indexes, backups, snapshots of other types of instances. We store them all on the CorezXs and we actually have written a bunch of puppet manifests and modules that would allow us to provision this stuff kind of on the fly. In the ideal world we'll create a little bit better abstraction tier to where again we're not having to interact with a config management and be able to drive this via full API perspective. But it is still pretty great that from a developer perspective if someone wants to spin up storage they need a terabyte or two terabytes or ten terabytes they don't really have to know and they don't really care where it came from but it's going to get provisioned on this type of cluster. Effectively into OpenStack that's just ZFS and NFS which we feel really excited about working with. Even in having this type of appliance these are open source standards and things that we understand very, very well. On this next slide in here this is too small probably for too many people to see but we plan on kind of seeing if anyone's interested to look at some of these modules and we'll just give away the code both for Chef or for Puppet or for some of the things that we're writing a more full-fledged API that installs and actually be able to manage these appliances directly from within OpenStack or your config management suite. So the last kind of piece that we're using on this Corehead platform is that we've got this NFS integration but we're trying to get to a point where we can natively provision individual sets or volumes. So whether you wanted it to be a RAID 5 group or a RAID 1 group or even just a collection of random disks provision those up to OpenStack instances in a way that you are controlling the block underneath. So you're not sharing tenancy. This is something that a lot of people get a long way with just saying you know hey I'm going to create a SAN everyone's going to connect to the SAN. It's one path up to OpenStack and all the VMs can share it. We really care about taking a different approach where for every single instance we create we're creating an individual block and we're individually selecting the disk that participate in that block. That really guarantees that we're not going to have things like saturation or any kind of bad neighbor type things that happen when you're deploying tons and tons of VMs destroying tons and tons of VMs. We don't want to get to a point where any one system gets really stressed and penalizes the other. This is kind of like a real serious thing for us in going distributing out. We just will not embrace anything that's going to force us to really heavily centralize. So as I kind of mentioned we want to open store some of these things. If anyone's interested in kind of collaborating talking to me a little bit more about some of the storage components. I'd love to chat with you outside. And we're also going to be talking about integrating with other types of block store and just DAZ models. So the next kind of piece is I'm going to talk a little bit about the idea of zoning and how we handle our networking. In our environment we have this notion of zoning where every single zone is independent from another zone. We do this not by using some kind of fancy tech it's we just use the things that are natively built into layer 3 routing. So we run OSPF and IBGP in our environments. We effectively run anycast on all of our load balancers in the future. We may actually run anycast just on our core routers. And with that you're naturally able to leverage layer 3 networking to split requests across multiple nodes. So this is a whole two and a half hour talk in itself. So again anyone that wants to get real deep into this grab me afterwards. But effectively what we end up getting is something that you know within the data center you have a provisioning layer that sits above the entire application that when a developer interfaces with it they're automatically getting nodes that exist in both zone. In each one of these zones you have a full set of firewalls core routers load balancers layer 2 a full copy of the entire data set. Everything that you need at the persistence tier and database level as well as everything you need in an open stack cluster. So each one of these zones has a dedicated open stack cluster if not two clusters. And if anything happens within an entire zone I don't care if it's one firewall fails or there's redundancy so two firewalls fail or something odd happens in the routing layer we just shut it down. Just shut the whole zone down and let layer 3 routing handle the failover between the zones. This is really nice because again we have an older application that you know it's not naturally built into the app to be able to handle this type of failover. So the way that we're able to do this makes it something to where developers don't really have to be super aware and super ingrained in the idea of how we're creating this level of high availability. The limitation on this structure is that I can do this within a data center. I can do it anywhere that's low latency where I've got like very fast metro links there's something that's 10 milliseconds but if I wanted to say failover from Boston to Texas it's a little bit more challenging because I can't replicate all of my database data out to that other data center in time to be able to manage the persistence tier. But this model can get you pretty far if you're just looking at inside a singular you know metro area a couple data centers within the same city or neighboring cities the ability to have a high degree of availability. We kind of talked through this a bit but a big part of this zoning effort is we have and we're constantly revamping this software defined idea of our network. Now we're not using a vendor specific technology we're literally just wrote an API that uses trigger which is a python library ssh and some of the xml or soap apis that exist on our juniper and brocade stack. From this you know we literally just roll everything out and integrate it directly within our provisioning layer. So again developers get load balancing for free they're getting the ability to interact with the network for free they don't really know how it works they don't really need to know how it works and we haven't you know put in a specialized solution to do this it's just some code that we've written. So kind of next steps in this area that are pretty interesting and you know Facebook has publicized this a little bit more recently is that you can get rid of the idea of load balancing all together or your load balancers all together and just use equal cost multiple path routing. So you can do this with any or anycast in IBGP or anycast in OSPF but ability to just have a bunch of routes to a bunch of servers that exist within a pool and let your core routers just you know sharded out internally. The cool thing about this setup is you know you're really really buy it into the idea that you're not using things like sticky sessions or any kind of really dense network technology you're really focusing on how do you shard traffic and just do everything kind of a round robin sense. When a node fails you just drop it out of the pool either with application logic or because you know IBGP quit exporting in which case it falls out of the routing pool. The last kind of piece we have is you also have a VPC into AWS that allows us some like flexible on-demand type scaling ability and again this is one of the huge huge things that we love about OpenStack and the way we've approached building our provisioning layer is that you know for a developer they don't know if it's AWS they don't know if it's OpenStack they don't know if it's hardware they all get it via the same interface instead of APIs. So once you kind of get into going into this world there's a lot of different options and a message that I have for anyone here that's thinking about how do you start sharding out your data centers is it's really really specific for your environment. There's a bunch of different approaches you could take this is one that we kind of like there's good publications out there via you know Facebook and Google and how they're approaching their internal network segmentation but moving in this direction really starts to benefit you if you're a single tenant website that's scaling up. I say single tenant because it's just easier to manage the availability when you're a singular company but when you get to this point anything that happens within a data center within a series of racks you can just treat it like a server shut it off you know let it sit there you know let the capacity kind of go. The last part that's really necessary is like you got to write a lot of tools to do this and this is why we're so dev heavy is that just everybody we've got it's got to be able to write tools they've got to be able to work with this environment in a way that they're effectively driving it via software. The next slide is kind of a an example of another tool that we wrote on the left hand side you see like some services that got some green lights those are checks that happen every single second against every single service we have internally so those services are just the ones that are required to run shudders.com on the right hand side we've got graphs that are showing today as well as the day before to look at deviations of numbers of requests that we're doing as well as latency that's derived literally real time from all the access logs of our customers so we're putting a tremendous amount of faith and instrumentation and our ability to leverage these tools to manage our fleet. So quick summary we use OpenStack for all VMs we thin provision you know via a piece of software that we've written we've got a highly software defined network and we zone within data centers and in the future across data centers for HA. So again big haul we won't take questions here but I'll sit outside for a few minutes if anyone wants to ask me any questions. Thank you very much.