 Hello, my name is Kevin Lynch and I'm going to be telling you about Kubernetes in the data center I'm going to be telling you about Squarespace's journey towards self-service infrastructure. I've been working at Squarespace for about three years So I've seen our entire transition from a model of application to microservices and onto Kubernetes Just some background for those of you who may not be familiar Squarespace was founded in 2003 and we provide a platform for people to host their websites their commerce platform And we host domains for them as well and we have over a million customers and Going back to that founded in 2003 means that we run out of data centers. We predate AWS publicly so we have three data centers over 5,000 hosts in these data centers and for scale Comparisons we have about 40 million metrics that we generate every minute. So there's a lot that's going on in these data centers all the time back in 2013 we were less than 50 engineers and we had a traditional Stack where we had a model of the application And you know there are some background jobs and they communicated to a database a lot of the world looked like this back then and These these engineers, you know, they they needed their goal was to introduce new features to build the product and to grow fast And really what ended up happening was, you know, whatever works was was what we went with You know, we grew a little bit then in 2014, you know, we had about 75 engineers and We realized that whatever works wasn't really working for us anymore You know engineers wanted to introduce new features But there ended up being too much firefighting, you know, people were unsure of The reliability and stability of the code they were introducing because there was so much going on in these model of the applications So like a lot of companies at that time, you know, we started down that microservices route and it was great We introduced a few services and we formed a very stable definition of service at Squarespace However, you know 2016 we have a hundred plus engineers, you know The platform was scalable and reliable and developers could move faster and Squarespace could move faster because of that You know, we had all these services that were communicating with each other and Taking traffic from from the internet And all of these services Wasn't really working for operations So this is our typical workflow for provisioning new machines in in our data centers and whether it's iron or virtualized It started with a very manual process of finding resources, you know, CPU RAM disk wherever we could create this virtual machine We'd have to assign make sure we could assign networking. We'd have to make sure the networking worked We had to add all of these hosts to our Ansible inventory and then we could kick off an automated process where you know We would update DNS and and pixie boot the VM and install the OS and configure the OS with all of these Dependencies that we've built up we'd have to configure monitoring for them and install all the application Dependencies as well and finally we could install the application This took this process at best took about 30 minutes because of all of the manual operations and in the pixie booting Oftentimes it would take a lot more because you know, sometimes it would be painful to find these resources or painful to You know run into firewall issues and whatnot. So there had to be a better way The big takeaway from this was static infrastructure and microservices do not mix It was always difficult to find all these resources. It was very slow to provision in scale and Because we're running our data centers We do have you know pets in this in this environment every machine has is a special Case right and we were trying to shoehorn this microservice cattle mentality into all of these these pets these physical pets that we we love And also this system was way too complex for any new engineers that came on board whether they were developer Development focused or operations focused. So we had to come up with a better solution in 2017 we have about 200 engineers a little over that and We started down this this path of self-service infrastructure Finally with it with the help of self-service compute networking metrics and storage We're able operations is finally able to move as fast as our developers and squarespace can actually move as fast as we want it to So for self-service compute we use Kubernetes. You know, we all love containers. That's why we're here And it's very simple, you know, you just say keep control apply and in some Description of what service you want applied and it just works However, there's there's a lot of pain points that we ran into that we weren't aware of at the time To take a step back our our service definitions that we deploy into Kubernetes closely match what we deploy is a service on a VM We have a Java spring boot based service model which relies on a lot of the Netflix utilities for service discovery and In routing of requests we use console for service discovery and key values We use fluent D for logging to ship all of these service to ship all of our logs to a centralized Elk stack and We also assigned resources to these Java services very similarly to how we assign them to our VM So each Java services On is typically given two cores or four gigs of and four gigs of RAM We do we do adjust this per for certain services as well But this is the default that we provision for all of them However, we were running into problems with the Java microservice and to really understand what was going on We needed to understand how all these pods are deployed to Kubernetes. So In Kubernetes each each container maps to a C group and these C groups are assigned to to each of these containers and there's they're given the resources based off of the Kubernetes requests and limits and We assign requests equals to limits for all of our services just to make things easier for us And once once this is done We were running into issues where we weren't seeing the the performance that we saw in the pods as the VMs and To really understand how this works. We needed to understand how the scheduler schedules these so We map so Kubernetes in Linux maps the CPU Requests to CPU shares and it's able to throttle the the resources of Based of the service based off of the CPU quota limits this is implemented in Kubernetes in Linux by Calculating the number of shares that are assigned to this based off of the CPU request times 1024 and Kubernetes is aware of all of the Available shares based off of the total number of cores on a system on all of our Kubernetes nodes. We run 64 core boxes So this would be assigned to about 65,000 shares for the entire box Now each service would then be throttled based off of the limit So it's given an allotment of the CPU limit times 100 milliseconds and that's calculated over 100 milliseconds So if you wanted to assign a container roughly two cores it would turn into a CPU limit of two of 200 milliseconds over 100 milliseconds. So you could have two concurrent threads operating at the same time So as an example on our 64 core machine boxes Our Java process would be given 2048 shares over 65,000 so it's guaranteed at least that amount of resources to operate and It's able to run for up to two threads during that period However, we were running into painful scenarios where The world would just stop when we were doing our stress testing and the reason for this was because the the JVM Garbage collector threads were using up all of the CPU quota. So we had to figure out why this was happening So the the JVM we when we Adalized this we saw that there were 64 garbage collector threads 128 jetty threads for doing HTTP Traffic and then there were 64 JVM for coin threads for parallel operations These numbers seem strange. They seem to match the number of cores. So we had to figure out what was going on So all of the libraries in in Java are a lot of them are Configured by based on the number of available processors Teddy does this the JVM does this and configures the for drawing boys pulls in the GC thread pulls and then there could be various other libraries that would access this to automatically tune and scale the number of of threads for a given operation and The JVM is able to detect this by running a Assist control call to get the number of online processors. We found that this wasn't being restricted by C groups So we had to figure out what to do there so we came up with a solution of providing a base Java container that would was able to calculate all of these resources and Thankfully because the C groups is mounted inside of every container and provides the values of all of these we're able to Pretty much reverse engineer the number of cores that are assigned to this by just looking at the the number of quota microseconds available and dividing that by the by the period and Then we're a automatic able to automatically tune the JVM by passing in JVM flags where we're explicitly setting this However, this doesn't solve all the problems we were we still needed a way to Override the available processors call and we did this by basically relying on a C shared library where we pretty much override the JVM active processor count, which is ultimately what the JVM cause and and we shim this in with a Linux preload hook Inside of the container So then when the JVM cause available processors We would return the number of cores that we've assigned to this via an environment variable that the base container calculated So now we're able to actually have self-service compute where the developer doesn't really need to know What the what the underlying mechanism is all they need to say is I want a Java service with two cores And they're able to get this now going back to how we configure our How our Java services look like I said we use spring boot for the core container and we use all the Netflix utilities For doing, you know, we use net Netflix ribbon for doing automatic retries, or we could do client-side load balancing Directly in the application. So every Java service is shipped with all this logic and then we can do circuit breaking with his tricks and service discovery with console However, we were in a world where we wanted to have our VM Infrastructure still communicate with our with our pods that we're deploying into Kubernetes and You know each of the the console agents need to communicate with each other for discovery The Java services will ultimately need to communicate with each other directly. So we leveraged Calico for this Kubernetes provides a pluggable container network interface and There are many options. There's flannel Calico weave cube net VX LAN probably a lot more than I'm aware of on this list We ended up going with Calico because it gave us a few benefits for running in our data center Calico provides software to find networking we can use that to configure net network policy IP tables access Rules and it gives us the the benefit of not needing network overlays In our data centers So that means we don't have any performance impact from doing encapsulation It eliminates any MTU overhead that we may have And it also gives us the ability to have seamless ingress and egress Because of our network setup So in our data center, we've deployed what we what's referred to as a layer three Clost apology. So this is a spine and leaf architecture and all of these Leafs are are are layer three networking so each each of those each rack Only communicates with each other rack over layer three and this makes it very simple to to understand because The spines aren't doing any any processing and all of the Leafs are their own layer two domain so any any MAC addresses don't really aren't seen by any other racks and this makes it really easy to scale out and predictable and Consistent for any network communication because anything communicating in one rack is going to have at most two hops to get to another rack And this also gives us the ability of having any cast support as well so all the all the work is performed at the leaf switches and and each of these these Racks are assigned their own BGP domain, which Calculator relies upon and this also gives us the added benefit of not having to worry about any Any issues like spanning tree protocol issues? There's no convergence time or loops that could be accidentally caused by this so Like I said, each of these has their own BGP domain, which means they have their own layer three network Slice so we can assign networking slash 24 subnet to each of these and We can also any cast the same IP across all of these so this makes it a lot easier to to To have services Communicate with just any any service IP and it will be routed to any healthy instance that can respond to that So we use Calculator to do BGP peering directly with these leaf switches and We as we pair that with all of our Kubernetes nodes as well And this allows us to seamlessly communicate whether it's VM running on on a rack in a different leaf switch It can communicate with any pod that's running in Kubernetes So this is a the typical Kubernetes architecture where we have some masters and some nodes running some pod workers Using Calico we're able to announce directly the pod IPs and these are represented as slash 26 subnets that Calico assigns directly to these and The Calico agent would then pick these up and announce them to the to the top of rack switch Likewise, we're able to announce directly the service IP range So all we have to do is add these to the loopback interface and tell Calico to announce these and all of a sudden we have the ability to for anything Whether it's inside of Kubernetes or outside of Kubernetes to communicate directly with this service IP range Which makes it a lot easier to For outside services or developers to access and likewise we can announce the We can we can assign an anycast Address for the API server that we then bind to the API server to so we don't need any any middleman For communicating with the API server We just bind each API server that we want to this IP address and we can communicate with that IP Which makes it a lot easier in case one of the master nodes goes down. We don't lose any traffic to the API servers so this allows a developer from their laptop to communicate directly with a pod IP to any any of these Pods that are running in Kubernetes. They can communicate with the master IP, which is then an anycast and Likewise, they can talk to a service IP as well so this gives us a really powerful way for our developers to have self-service networking without any Interaction with us and then we can leverage the network policies as well to automatically assign Firewall rules for instance If we want for these services as well We the other benefit of this is that we can also Rely on this for federating as well because we have two data, you know multiple data centers we can Seamlessly communicate from the the pod network from one data center to another data center and using the Federation API We can then control both of them at the same time so the other the other self-service tool that we decided to to use was Prometheus Historically we use graphite for all of our VM based metrics This had some problems Unfortunately, the application and collectee is are sending metrics Either from the application or from the from the system and there's forwarding this to graphite Unfortunately, we can easily run into scenarios where the application developer just added a new endpoint to generate metrics and we have this Exponential explosion of metrics as we deploy them to all of our services and All of these are aggregated all these metrics are aggregated in graphite where each metric ends up becoming its own file so which is slow to create on all on in this cluster and also really slow expensive to To clean up so when we have all of these ephemeral pods We can't really rely on this because we end up overloading the graphite cluster and bringing it down Graphite also has some other problems as well like there's loss of precision when it does Aggregation aggregated roll-ups of all these metrics as well And it's it's kind of inefficient to do aggregated calculations across all of these We then use sense to to actually do alerting based off of off of the sensory client that runs on the host and We can also trigger we can also have that query graphite as well However, we were running into some problems with this the the application and in the system are very tightly coupled It then becomes really difficult for us to route alerts to it What happens when the application goes down? Who do we route the alert to the application dev? What happens when the system goes when when the whole host goes down? Do we route that to the application developer? Do we route that to? The team that's in charge of running that host and what happens if the whole hypervisor goes down? Who do we alert so we really want these? Our application level alerts to be service level based However, when we pair sensor with graphite it becomes really confusing to create Because we had this centralized this other repository where devs would have to have to go into and add all of these checks and The checks were actually really expensive because doing all of these aggregated checks on graphite is painful so Incomes Prometheus Prometheus Ties in really well with our Kubernetes infrastructure. It gives us the ability to do automatic discovery of all of our Containers whether it's using the Kubernetes API to communicate with pods or it could communicate with console and discover any VM based Application as well There's no loss of precision because it just keeps appending metrics and it's really great at Storing tag data so we can store Metrics based with tags of whatever the service are running what pod they're running on Which endpoint we're using to collect these metrics from and so forth And it's really really great for these efficient For these ephemeral instances, so when a new pod comes up all it is doing is is modifying the tags that are For that are being generated for that for that metric So all of those would be aggregated in in ultimately the same Prometheus file So we use the Prometheus operator for this so we assign each Team that wants to deploy services to Kubernetes their own name space. We also give them their own Prometheus instance as well So this gives us a few benefits because we are able to then Separate out each each teams names each teams Prometheus metric collector And that that gives us the ability for you know when one of these Prometheus goes down We don't lose metrics for everything. We just would lose that those metrics temporarily for that one team so the Prometheus operator controls all of these Prometheus instances and they'll look at they'll collect metrics from all of those team services and We provide a centralized alert manager, which is also configured through the Prometheus operator that would then Route the alerts to pager duty So at that point all the service owners have to do is to find their own alerts for all these services And those alerts are the the Prometheus alert manager specification that look like this all we have to do all they have to do is provide Service-level style check where for instance if we wanted to look at the error rates for a given service All you have to do is look at the response the check for response codes of 500 and if that is You know high for five minutes Then we would just send an alert to whatever team that is with a page that says this is critical. Please look at this so the final the final tool that we provide for our Developers is self-service storage Historically we have a centralized NFS cluster for all of our five files and We were running into a lot of problems where it was very difficult to Spin up a new VM based service that would have access to this all of the access controls were were manual manually configured in the NFS cluster and we could also often run into issues Depending on the application with you know things like file locking issues Where you know some service would be locking a file it would die and wouldn't be able to Come up again because that the file is locked on NFS We also would leverage ESX local storage as well However, this had some problems of you know slow migrations if we needed to spin up a if we needed to move the VM elsewhere We would have to transfer all the data from one host to another and there was no replication So in order to really provide self-service storage infrastructure, we needed something else That's where Seth comes in Seth gives us the ability to Deploy on commodity hardware. So all we have so we don't need these expensive NFS machines and all we have to do is Scale out the system whenever we need More storage Seth provides automatic replication and it provides multiple access patterns as well So we can access either block storage directly We can access, you know, CFFS which is very similar to NFS and we could also access An object store it provides a an s3 compatible API so again we to Rely on this we We in Kubernetes. We have Automatic provisioners that are based off of the storage class concept All we have to do is define a default storage class for the block device and provide a provisioner That has the ability to access Seth and create all of these All of these blocks storage devices as they're requested So all a developer would have to do is say, you know, I need a stateful set and here's the the persistent volume claim that goes along with that and Then Kubernetes and the automatic provisioner take care of the rest The RBD provisioner would then detect the creation of this of this pod of this Persistent volume claim and then create the block device and then allow the stateful set to then mount this This persistent volume which is then mapped directly to the to that block device and it'll manage the the lifecycle of that as well so when you get rid of the stateful set the The block storage is cleaned up as well So we rely on this for some tools so we have Prometheus all of the Prometheus instances are relying on these purpose persistent volume claims to To store all of their metric data for some amount of time We also have some deployments of MongoDB and Postgres relying on this as well and they've been running They've been running great for us so far Likewise we can rely on this CFFS shared storage for this this gives us the ability of having multiple services access the same NFS pull by creating just a single PVC that any service instance is able to access and then The CFFS provisioner would create a shared mount that all of these are able to access and share data across The final way for accessing of course would be for the service to directly communicate with the S3 like API so all all that has to do is You know create a bucket and then access anything in there and then this would be completely separate of any of the Kubernetes automatic provisioners So with the help of Kubernetes Calico Prometheus and Seth We're able to provide self-service infrastructure for all of our developers and we've seen a lot of benefits from this Our existing services are migrated very quickly all it takes is a couple hours of work to to migrate a service verify That it's healthy and create the appropriate Prometheus dashboards for it We're also seeing a lot more adoption from developers we're seeing about 20 new services that I'm aware of that are being planned for q1 which is a lot larger than what we've seen before for one quarter and We're really seeing true microservice adoption So we're seeing a lot of developers who want to spin up, you know small experiments That they otherwise wouldn't have been able to do in our infrastructure because of the long process for creating all of these VMs So So finally Squarespace is able to move as quickly as we want it to Well, thank you for listening to me. Are there any questions? Yeah, so we we run the databases in Kubernetes So we have some stateful sets for Mongo and We've got some stateful sets for Postgres and like I said earlier All they have to do is to find that PVC and then they have shared storage it So then when the pod goes down, it's just migrated to another host and comes back online with the same data So we run all of the Prometheus instances in the same cluster We then federate up all of the metric data to an external system as well That has a lot more capacity for storage so We don't abstract that we are right now the service developers are Writing the the YAML the Kubernetes YAML descriptions, but we do have a generator tool for the new services that will generate an entire Spring boot application and along with that we generate all of the The prescribed deployment and and alerts for all those services. I'm sorry. I didn't with the network policies. You mean Yeah, so so it's connected to yeah Yeah, so so so we use So there's redundancy at the at the network connection layer and then there's redundancy with with Each of the each of the Leafs communicating with with two spines Yeah, so sorry Yeah, so so so spring boot. We're not relying on console for the configuration itself so we're using so so when they're deployed as VMs we just use a Just a regular config YAML file that we've deployed to each of those hosts along with that code We can also use environment variables as well, which is what we use in the Kubernetes environment if there's any any Information that the service wants itself It would communicate with the key value stored directly With console. Yeah, so we haven't We haven't had the need yet to really like squeeze performance out of it So we haven't gone down that path yet We just wanted to make sure that there was no performance loss moving to Kubernetes So I think that's something we're going to be exploring later on like really how small can we can we get all these services? and and and what's the trade-off between Instance size and number of instances Because you know we're dealing with the JVM, which isn't the friendliest of beasts No, the the memory size seemed fine because we were Well, I mean we pass we pass in very similarly in that base container we calculate the the number of the the amount of RAM that's assigned to that and then we scale the heap appropriately, so I think It's it's tunable, but I think by default we give the heap 50% and allow 50% of non-heap But that's tunable per service as well cool No, we didn't go down that path. We did use mesos for a little bit our data team was using that as kind of an internal test and That was right around the same time that kubernetes really really was becoming popular So we saw a lot of benefit in switching to that early on No, we didn't what's up. Thank you