 Okay. I think we're good to start. Hello, everyone. My name is Michael Benedict. I'm APM on the Cloud and the Data Infrastructure team. I work on the Container Orchestration team at Pinterest, and I also manage the Infrastructure Governance and the Capacity Planning team. I have with me here my colleague, Lita, who is the tech lead on the Container Orchestration project. So the agenda for today is we'll start off with a quick overview of Pinterest's history and just in general stats about our infrastructure and the current state of that world. Then you'll talk about the larger compute platform, which we're building at Pinterest. Start from the high level of the vision and the scope to get a set of context as to what it all means, and then eventually go into the meat of the presentation, which is about the orchestration system and what we are doing and how do we get there. And finally, I'll close out with the future as to the next steps going into 2018 as well. All right. So with that, how many of you use Pinterest? OK, that is great. So here's a quick story. So Pinterest was born back in 2010. It started off as a simple Django application running on the cloud from day one. This is what it looked like. The goal for Pinterest at that point was to collect and organize a lot of ideas with people pinned from searching on the Internet. Fast forward to seven years in 2017, Pinterest has grown to become Pinterest has evolved into a discovery engine. Today, we focus more about discovering pins which are relevant to users focused more on their interests graph rather than just generic pins. As you can see, we also advanced a lot into image detection and detecting objects or items within an image and recommending relevant pins and categories for that particular object. Today, we do about 200 million users, monthly active users. And what's more interesting is that these users or pinners, as we call them, have so far pinned about 100 billion objects to put that into context. I mean, that's a huge number. That's it. So the interesting part about even that is that each pin has a rich set of metadata about who posted that pin, where was it extracted from? With site or image or video is that and the specific tagging which the users themselves have done are relevant to that pin. A single pin usually can be pinned by multiple users. So you have a lot of context or interest graph generated as part of a single pin to power that we have tons of services behind the scenes. So this beautiful bubble visualization essentially just shows all the serving services which are categorized by the amount of VCPUs they're essentially using. Like I said, Pinterest runs on the public cloud. We have been on AWS since day one and we have grown to becoming one of the larger customers as well to give you a perspective on how diverse our compute workloads are. We started off with our Django monolith, which has now been broken down to a large set of microservices. We do have in the tune of about 1,000 microservices running on our infrastructure. The hosts are tens and thousands of hosts, but a large portion of Pinterest infrastructure is actually on the data processing side. We do use Hadoop and Spark Streaming, as well as TensorFlow largely as part of our machine learning platform. To provide some context, we have like about 10 raised to five number of jobs running at any given point in time, specifically on model training, analytics pipeline, just general compute work as well. And our footprint on the block storage system is also pretty immense in the tens of hundreds of petabytes. We not only use it as a storage mechanism, we also leverage it mostly like a transactional, analytical backend as well. So we do really push the limits on some of these systems. The current state of the world looks something like this. So Pinterest over the years has been adding multiple layers of infrastructure to suit certain use cases and needs. And what has happened is that we have come with these diversity of tools and platforms to support those growing needs. As of today, we already have like multiple mechanisms of provisioning resources on top of AWS. We have different compute platforms to help users run certain set of workloads on these platforms. And our use cases are also diverse and growing over a period of time. Unfortunately, this also brings a whole new set of challenges around tech debt and just general end-to-end developer experience. Meaning, if a person wants to launch a data job, they have to follow like 10 different steps to get to their end goal. While they're launching the long running services, they have to probably do a whole different set of steps to do the same. And you can imagine the permutation and combination which will enhance this complexity for a developer and it really leads to like less velocity in terms of when you're going from an idea into production. Even worse, it adds a lot of support and operations challenge on the infrastructure teams which are managing these systems. And fortunately, it is also hard to migrate these jobs or workloads off of these existing systems because I think over a period of time, we've just lost all that context and knowledge about how they work. And for what it's worth, they've been just running there for the longest time. So these are like a lot of these challenges which we had to go through and think through a bit more in coming up with our strategy for the next few years. We have to support Pinterest's growing needs in terms of its investments and the machine learning platforms on the data processing side. So we had to think through for more of a holistic solution. The one more key point to add over there is it's also very difficult to actually implement governance around these platforms when you have such a diverse set of platforms to truly know who's using what and what are the resources being used is extremely, extremely difficult. You would just have these blocked chunks in your bill saying that this was the maximum of VC2 instances used by certain tags, which makes it really hard to have like decent ROI conversations with the actual users of those systems. All right, so with that, I just want to quickly run through what our vision for the compute platform is. This helped us ideate a bit more and we came to the conclusion that our goal, regardless of orchestrators, regardless of which cloud we are running on or whatever that is, it should be to take to help a user provide a fastest path from idea to production without really worrying about anything on the underlying infrastructure. Our key focus areas were divvied up into three things. We wanted to simplify the end to end developer experience as part of building, deploying, launching a service and operating it, not only a service, but also a data job or any other type of jobs. We wanted to provide an integrated infrastructure platform, whether you're running a bad job, RA long running services, metrics, logs, health checks or even job statuses, all of them need to be provided to a user through a single experience. They don't have to go through any different tools to just get that data. And finally, given Pinterest's size and scale, it was extremely important for us to invest heavily on a proper governance platform, meaning we truly wanted to understand what controls and policies we can build into this compute platform to ensure that we know who's using what on the system, who owns what, and subsequently like either charge it out or have relevant TTL policies around storage of data and many of those things. I'll give some more examples as I go through. To ensure the scope of the problem kind of exploded just from orchestrators to these different pieces along the setup, test and build resource management, deploy release and operations. To give you an overview, this is our interpretation of what a developer usually goes through in terms of building and launching a job. And subsequently, these are also individual problem areas which need to be resolved. For what it's worth, over the last seven years, Pinterest has built multiple solutions across many of these boxes. And like I said, they were built for specific needs at that point in time, which has just been added over a period of time today where integration across all of them is extremely, extremely difficult. So from our perspective, this is really the problem domain for the compute platform and is our goal to really stitch a narrative around it and provide a solution. But that said, we have to start from the bottoms up. So it comes down to actually the cloud resources, the orchestration systems, and even those primitives around containers or VMs, which need to be defined first before we can go up anywhere. Another context to be added over here is Pinterest has been running on the VM virtual machines for the longest time. We run on bare EC2 instances, at least on the serving side, and each service runs on their individual instances. So with that in mind, I think one of the first things we had to just figure out was how do you containerize this? So this was really like a timeline of the initiation of the project to where we are at the moment. I'm just going to quickly give you an overview on this and touch upon one particular topic, and then Lita will give you the implementation details across this timeline. So way before even I started at Pinterest, the conversations about moving to containers was kicked off in 2016. And between the end of 2016 and early 2017, the containerization project was like scoped, implemented in terms of deciding on Docker as our primary container image format, even the runtime itself, and then working through a host of solutions just to integrate with AWS and our legacy implementations on the security side, on the networking side, etc. We had to then productionize Docker and adopt it over the company. I'm going to particularly talk about this one on how we actually went about from that phase to going into an orchestration system. So many companies are facing this today in terms of how do you actually evaluate? I know many of you must be already using Kubernetes, so you're really not here to learn about this. But I do want to give you an overview as to the pain involved in going through this process and coming to a decision, because this is not something easily done at a company of this scale. So TLDR is we had to build this evaluation framework where we decided what the criteria is, the various options we had in terms of choosing something, the POCs we had to do, and subsequently what was the outcome. This was done in a span of eight weeks from an implementation perspective, but the discussions have been happening for more than two years. So with that said, the criteria around how we wanted to choose an orchestration system. The most important thing for Pinterest literally was the integration cost. What does it actually mean to move from a VM's world into this new world? Specifically around the Docker support side, how do you support sidecars? Pinterest has a lot of sidecars running alongside, both in the host level as well as in the application level. How can we support that? The runtime extensibility, the networking integration. So Pinterest has already invested heavily on the networking side of using ENIs and having security groups assigned to those. So we have a pretty robust system already in place. How do we integrate with that, which is both simple and easy to debug and easy to reason out with as well. The other aspects of what was really important for Pinterest was also the resource and the task scheduling side, the flexibility around how we can place tasks, co-locate tasks given some of our services, and just the multi-tenancy aspect. This comes from the infrastructure governance piece, which I spoke about, because we know for a fact today many of our services not effectively utilizing any of their instances, so we really want to push that limit in terms of effectively utilizing those resources. The other aspects around the criteria are just this regular scalability and the performance of that orchestration system. Stateful service support, which is extremely important for us given that a large portion of our infrastructure is all about data processing. The ecosystem and the community, and finally the cluster operations and the support. Our choices were pretty... We had a lot more than this. We had to condense it down where we actually did a detailed POC on all of this, so obviously Kubernetes was there. We also looked at Mesos and Aurora, Mesos and Marathon, and even thought about a custom scheduler if you were to build up. Again, Mesos is a pretty battle-tested system, and we also have some experience operating Mesos within our company for our real-time streaming platform. And finally, even AWS's ECS offering, we did look at that when it was there just to get a sense of what is the difference in terms of a hosted solution versus a managed solution, and what would it mean for the teams operating? I'm not going to get into the depths of POC. Leda will cover more details on that, but the TODR is we do find some commonalities around many of these systems, and we did do detailed POCs on each of that, especially on the networking and IAM side. Again, Leda will cover that at the later part of the presentation. And finally, the outcome was like it was a conscious choice. We did see some shortcomings as part of making this choice, but it was a conscious choice that we could actually work through many of those shortcomings. For example, data processing, like I said, is a big thing for us, so big data support in terms of running Spark or operating bad jobs and those kind of requirements. We had to really think through as to where the community will be by the time we could productionize Kubernetes, and what our investments are. So we are announcing that Pinterest is kind of adopting Kubernetes for its container orchestration, as well as our commitment to work with the big data sigs and the AWS sigs to actually further the cost of enabling data processing on Kubernetes. With that, I'll hand it over to Leda to actually talk more about the implementation. Hello. Okay, that's cool. That's great. So thank you, Michael. And hello, everyone. So I'm Leda. I'm a software engineer at the Cloud Management Platform at Pinterest. So I think Michael has given you a high-level view of what's our goal, what we are doing, and I will bring you drill down to the technical detail what we have done for our container platform and what we are currently doing for our container orchestration effort with Kubernetes. So this slide gives you a view that before we are doing any colonization, we are running our stateless service on the EC2 instance. So we have layered AMIs built on it, where we have Puppet managed our machine configuration and we have Tattron, which is our in-house deployment system, which manages the service-based deployment. Beyond that, you have virus process management and running the service. So these things had several pain points for us. So the engineers has to be involved into the AMI building. They have to learn how to write Puppet and the more difficult how to test a Puppet change. And the Puppet run has some unpredictability, which brings us reliability issue. And also, engineers has to learn how to write the different configuration for different process management to like monitor, upstart, and supervisor. So this is the end result after we finish our Docker project. We are running, we still have a dedicated VM for service, but all the bits are running in Docker container. So in this model, we have a single AMI, all the services are using single AMI. We get rid of the specific service AMI. And we don't have, engineers don't have to learn how to write Puppet, and we have only one process management, which is the Docker engine doing the process manager. And these things give us immutable infrastructure and also provide our infrastructure to give a deterministic behavior in our deployment and operation, which improve our service reliability a lot. So then I will talk in detail about under the hood what we are doing for this Docker platform. So first, we introduce a Pinterest service specific description language, which you can see, it's on the end of the slide, in this slide. So there are several motivations behind why we give this service description language. So first, we want to the engineers only care the configuration they care about. So then the underlying system can expand this configuration to append things like what's the required mounted volume or environment variable required for the infrastructure. So engineers don't have to set them and understand how to set that. The other thing we want to have here is that we want to provide an abstraction layer to the underlying technology we are using. No matter we choose the missiles or choose the Kubernetes later on, and then engineers don't have to rewrite any of their service configuration language. And also on the dev developers workflow, we did some work on the optimized some image building on our largest share image to reduce the build time and the image size. And the last we are running Docker registry as a we use the ECR as our main registry. But meanwhile, we have a replicated self hosted registry combine them together, which give us the high availability in the production environment considering our scale. So this is what it looks like under the hood in the application runtime. So there are a bunch of side cars around the application to provide all kinds of service. So all of them are running in containers. So we provide a tool which called TaliFig which basically manages the dependency among these containers. And also in your application, it might be composed with multiple containers. So TaliFig not only manages the dependency between the side car and the application, also manages the dependency inside the application and even between the different side cars. And we are running on the network model in the host network model, which give us the best performance. And our Docker engine, we use the live restore so to ensure that the Docker engine restart won't call all the container being killed and restarted. And we're running on this overlay tool file system. We also build the garbage collection in our local image. We keep a configurable number of the image for every kind of service on the local so that we can roll back and roll forward fairly fast. And the last thing, we build some parallel prefetching. This is mainly because our deployment system has a mutual exclusive lock on the host. So at any given time, we only allow one deployment to happen. But for this particular Docker environment, we want to parallel pull all the images. So it can make our deployment be faster. So this is after we have built this platform, we started the migration sometime in June. So up to today, we have 75% of the state service already in this Docker container platform. And so we spend a lot of time for the migration and in the last few months, our team is basically working like this. So there are several learnings I want to share with you about this migration process. How to make sure the service, there's no altitude during this migration. We have hundreds of service there. So I think the first thing is that you should invest in your tools in this migration. You need to have a deployment tool that be able to run your container and the non-container at the same time. And you can adjust the ratio. And you need the tools to be able to compare the metrics coming from this to workload. And most important, this tool must be very easy to use. For your engineers that are not in the platform team, they should be able to use these tools when they start to migrate your service. And also you should automate all the deployment migration and runtime configuration validation when you are doing this kind of big migration. Make sure you have all the IAM row security group, all the service discovery settings all correct. And also another learning is that container is a very fast-changing area. So you need to understand how your company works, how your team dynamic works so that you can build the team that can do this work most effectively. At Pinterest, we have an embedded SRE mode. So basically the SRE sits together and work together with our development team. This helps a lot in this project. So whenever we hit a bug in the Docker engine, which I have to upgrade the OS kernel and have to rebuild the AMI, we get a very fast turnaround time on that. And the last thing I want to call out is that you want to try to migrate a complex service earlier. So we decided to migrate our most important service, which our API fleet, which is one of our largest fleet, hosts our mobile client and web client at a very early stage. It takes a while, but it'll help you to cover every corner of the platform, make sure it's really production ready. And meanwhile, it gives the confidence for the other engineers to move their service into this new platform, because we have already moved this huge fleet and it's very easy to persuade them to move to the new platform. So next, I will talk about our current container orchestration effort, our orchestration evaluation and what we have built for the Kubernetes right now. So first, I will talk about the evaluation. It's a very hard work to do the evaluation. It's very time consuming. And we built the POC, which POCI means we have a demo service which all our integration with our environment for both missiles plus Aurora and for the Kubernetes. So for the missiles, we evaluate a different containerizer, different schedulers, and what's our executor strategy? We try to write our customized executor in this POC. And for the Kubernetes, it's like there are a lot of, there are long lists, a lot of out-of-box features. We actually try to do the POC and test it. I think that more importantly, I have a common ground of the things. I think are very important when you try to build a container orchestration platform. There's networking, your security, how you are doing the service discovery, and how you're doing the log and metric security management. I mean, at Pinterest, since we have been, there are several years and we have a large fleet. We already, for some of them, we already have the solution. And some solution are easy to migrate to the container world like service discovery. We're using ZooKeeper as our service discovery platform. As long as you give the container a routeable IP address, probably it's a minimum code change. You can move to there. But there are other things that require more work. The things that we have been built, they have been battle-tested and they have been proved to be able to support our current scale. Even there might be some glitch in them. We still want to keep the moving parts as small as possible. So today I will focus on talking about two things, networking and the IAM what we have done for our Kubernetes cluster. And I think this is most relevant and useful to share. So networking. Networking is the most hard part and we spend a lot of time on it. And even today, I wouldn't say we have a fully satisfied option. So today, especially you're running on the AWS, I think there is no standard solution we should use today. And for us, we build a kind of our networking platform to be more flexible. So we're using the AWS ENI, but also we support different kind of the CNI plugins. So we'll talk a little... So essentially, we use the CNI for our long-running service, which the service that hosting, you have a pod, you have to have incoming traffic. We're using the ECS, there is the AWS open source, the ECS ENI plugin, which give a dedicated ENI to a running pod. So we're using that for our long-running service. The major motivation is that it's simple. And basically everything is more very leaning towards what the VM looks like. And but this has a major limitation that is the number of the ENIs you can attach to a single EC2 instance is limited. And at the current moment, we're looking at our daily service. Most of them are running on 8 costs or 32 costs. We think it can serve us for a while. But meanwhile, we are collaborating with AWS, looking at their latest released CNI plugin. We also track the progress from the Netflix and the Lyft. They have some shared ENI strategy. In the future, we may change that. But meanwhile, our infrastructure also supports different ENIs. Today, we are running our batch job and our basically worker actually in the bridge mode, which is very like the Docker bridge because you don't need to serve the incoming traffic as long as you can go out. That can work. So we support the IAM role and the security group on the ENI. So this diagram gives you the detail about how we are going to do the networking setup when a pod started. So when a Kubernetes pod is started, the Kubernetes basically creates the pod container and starts to call the CNI. So we create a proxy CNI plugin, which basically invokes the next daemon, which will have a daemon running on the host. So this daemon will actually carry the Kubernetes to get the pod spec. The user basically specifies the network model in the pod spec. So whenever this daemon gets the pod spec, it starts doing the ENI management. So even we are using the ECS ENI, but we don't have the ECS control plan. So we have to build our own control plan, basically do the ENI allocation. So this is all in this daemon. So this daemon will basically look at the current ENI status on the host, the network model of the pod, the current ENI on the host, if it's asked for a dedicated ENI, allocates the ENI if necessary and assigns the ENI. So it will return which CNI plugin that you should invoke and what's the input parameter. And the proxy CNI actually invokes the CNI. Meanwhile, this daemon always keeps reconciling the host state and updates the Kubernetes status. So we implement some customer resource record for the elastic interface on the Kubernetes. And also this daemon keep updating the node label so that the new pod will not be scheduled to this node when all the ENI has been used up. So the next thing is about the IAM setup. So we run things very similar to many of the existing solution for the Docker and all the other container orchestration. So basically we're using a thing called a drum, which is developed by a Pinterest security team, which do the metaproxies thing. So the drum basically listens on the host and network interface. And whenever a pod started, we actually configure the IP table rule to redirect the metaproxy call to the local interface. And the drum actually gets the request, is to check the IP address, and also it's talked to the local Kubernetes endpoint to get all the pod status and the configurations. A little bit different here is that we also, the drum will call out things we call the row assume service. The main reason we are doing that is for security reason, we don't want to give the base row all the assume row permission in the EC2 setup. So by this way, whenever a pod sends a metaproxy request, we will redirect to the local drum and gets the right IAM credential. This will work for both the ENI and the bridge mode. So this is all the technical detail. Now I will hand it over to Michael to talk about our future roadmap. Thank you. Thank you, Lita. So we do have more details on that, so we will be at KubeCon for the next two days. Our team is also here, so we love to talk more about people who are working on similar problems, especially on AWS. With that plug, I just want to quickly go back to the larger scope we are working towards. Again, this is not like a three month or a six month project. We do expect a lot more to come out of this. As we build out our orchestration platform, we also intend to have abstraction in terms of facilitating job deploys and orchestrating those deploys as well. Like I said earlier, we do have multiple systems today to do that for bad jobs or long running services. So one of our priorities essentially is going to be consolidating that and have a single place of facilitating job deployments. So some of the things we are planning for 2018 is obviously productionizing our Kubernetes cluster and adopting some initial use cases. Our team has been working really hard on getting Jenkins and those kind of workflows already running on it. And we also are experimenting with running some non-critical long running services. We are also experimenting on trying Spark to run on Kubernetes. I think that's going to be the biggest thing for us and we intend to see some progress on there. At the same time, like I mentioned earlier, our job definition abstraction works pretty well. We do want to normalize that across all of Pinterest and have a single job submission service which will facilitate deployments either to Kubernetes today to Hadoop, Yarn, or something else in the future. So that way, at least the spec still stays the same. There's a translation which is happening and we intend to also do ownership checks, quota management on all of that stuff at the job submission layer. And finally, just to close out like the service identity portion. So like I said, make sure that the identity of the service is validated and checked. There's clear ownership on it and we will also track the resource metering both on Kubernetes and all our other systems. So with that, I think we're out of time so I'd like to close by saying thank you for being a wonderful audience. We really appreciate you guys coming to us.