 Hello. Thank you for coming. It seems like I'm closing KubeCon, last session of the last day. My name is Layla. I work at Shopify as an infrastructure engineer. I'm on a team of eight engineers called the Search Platform who develop and maintain the infrastructure at Power Search at Shopify. Today I'll be talking about Search, a Shopify which is a highly available platform for data resilience and compliance. In the next 20 minutes or so, we will learn how the Search Platform team at Shopify hosts the Search infrastructure on top of Kubernetes. We will talk about the entire data pipeline that writes data from SQL to Elasticsearch through Kafka. And I will talk about the system requirements for Search such as high availability, scalability and globalization and how we design and implemented a platform to achieve them. And I will conclude my talk with a Q&A. Introducing Shopify for those who don't know about us. Shopify is a cloud-based commerce platform that lets you start and manage a business by allowing you to create and customize an online store and manage inventory payments, customers and etc. Currently, we have over three million businesses using our platform. We have merchants like Gymshark, Fashion Nova selling their products with us. We represent over 170 countries around the world and have about 10% of the total U.S. commerce and have had above $400 billion in global commerce activity. One of the largest events in commerce, especially in North America, is Black Friday Cyber Monday week. We call it BFCM. To share more stats about the scale of Shopify during the BFCM of 2023, which was a few months ago, Shopify processed 145 billion requests in a day, peaking around 60 million requests per minute, which led to a total of $4 billion in GMV. Search is a fundamental part of any commerce platform, allowing buyers to search and filter products and also merchants to fulfill orders and manage their customers. When you go to any online store, and for example search for a product, your request goes to a search engine and that's backed by a secondary data store, which is different from traditional databases. We call this secondary data store the search infrastructure. The next couple of slides is going to be a quick refresher on the two key technologies we'll be talking about throughout this talk. The first one is Elasticsearch. It is a distributed text search and analytics engine built on top of Lucene that has full text search capabilities that are well suited to the e-commerce domain. It's also a scalable and file tolerant system and it handles a lot of hard parts of running a distributed system very well. The other one is Kafka. It is an open source distributed streaming platform that is used for building real-time data pipelines and streaming applications. Kafka is designed to handle large volumes of data in real-time and provides a scalable and file tolerant architecture for processing and storing data. It is commonly used for use cases like log aggregation, real-time analytics, and event-driven architectures. Kafka is built around the concept of topics which are streams of data that can be partitioned and replicated across multiple nodes in a Kafka cluster. Producers can write data to a topic and consumers can read data from them and read data from them in real-time. At Shopify, Kafka is our main messaging service. So when we build applications that need to pass large volumes of data to each other, we use Kafka. So in our use case, we rely on Kafka to receive Elasticsearch documents that are built from SQL records and we consume them and store them in Elasticsearch. Shopify is a large and complex technical ecosystem made up of a variety of apps and services. And the biggest service at Shopify is called the Shopify Core. It is a big Rails monolith that powers all of our merchants and store friends. We run Shopify in GCP regions for resiliency. What you're looking at here is a high-level structure of Shopify Core. Entire Shopify Core in a region is broken down into above a hundred logical groups of SQL databases and instances of other services that are required to run Shopify. These groups power all of our online store functionality for shops running in that region. We run one instance of Elasticsearch for Shopify Core that provides search functionality for the shops in the region. And to provide more details, we have an ingest pipeline in each region that consists of Kafka topics and Kafka consumers. We produce records from all SQL instances in the region to our Kafka topics and our consumers pick up those messages from the topics and write them to the right Elasticsearch index in real-time. And for simplicity, I will refer to all of those groups of services as Shopify Core from now on. To give more context, Elasticsearch is used as a secondary data store. It's utilized to provide additional functionalities such as fast search capabilities, aggregations, or full-text search on top of the existing dataset that's stored in the primary data store. Indexing is a term we will use a lot in this talk. It is the act of writing documents from the primary data store to Elasticsearch. And we saw the ingest pipeline in the previous slide that brought data from SQL as the primary data store to Elasticsearch through Kafka. On search platform, we have built two different indexing pipelines for two different write profiles. The first one is called the real-time pipeline, which for example indexes a product to Elasticsearch and makes it available for search for buyers when a merchant creates a new product. And the other one is called the re-index pipeline. Well, we have a lot of developers at Shopify. They create indices. They modify index analyzers and add or remove fields from them. And when that happens, we need to migrate all that data to a new version of the index and promote that index to make the new features available to the merchants and buyers. And we call this re-indexing, which means basically building an entire Elasticsearch index from SQL. At Shopify, we run our fleet of Elasticsearch clusters on managed Kubernetes clusters, GKE on top of GCP, and we deploy and maintain them using a custom Kubernetes controller that we have built. A Kubernetes custom controller allows users to define and implement their own custom logic for managing and reconciling resources in a Kubernetes cluster. Custom controllers are usually designed to manage custom resources, and custom resources enable users to extend the functionality of Kubernetes by defining new types of resources that are not part of the core Kubernetes API. As mentioned before, we have designed and implemented a custom controller for search that manages a custom resource that we have defined called Elasticsearch, which is basically the desired configuration for an Elasticsearch cluster. Our custom controller watches for new Elasticsearch resources or any updates that are made to the existing ones. And the custom controller will create many Kubernetes native resources to build an Elasticsearch cluster. For example, it creates stateful sets to store the search data in a few deployments to provide monitoring and observability features for the created Elasticsearch cluster. It also creates the right services, ingresses, and certificates to provide a secure access to Elasticsearch. Here you can see an example of an Elasticsearch custom resource definition. This specific Elasticsearch definition requires two separate node pools. One is for storing data, and the other one is for coordination. Once this custom resource is applied to Kubernetes API server, the custom controller will create all these Kubernetes native resources I see here. We also referred to them in the previous slide, and this whole thing will create our Elasticsearch cluster. And you can see here that the controller created two stateful sets, each of them having three replicas as it was requested by the Elasticsearch custom resource. This slide is a refresher on the structure of Kubernetes stateful sets. A stateful set config includes a pod template as well as a volume claim template that defines the type of storage that should be provided to each pod. When a stateful set is deployed, according to the number of replicas defined in its config, it creates pods based on the pod template and requests for storage based on the volume claim template for each of those pods and assigns it to them. In this slide, you can see one of the data stateful sets that is built for Shopify Core Elasticsearch. These Elasticsearch clusters are our biggest ones containing 156 replicas, each of them having 4 terabytes of data. Our custom controller manages one to many Elasticsearchs. They can be large ones that have 260 nodes and they can be smaller ones with just three nodes. At this scale that Shopify operates and with the impact that search has on merchants revenue, its infrastructure should be designed to meet certain requirements. Search needs to be highly available to ensure that it does not stop revenue. It needs to be scalable to provide the service when the load increases and be available for merchants and buyers across the world. High availability, our first requirement, refers to the design and implementation of systems that are continuously operational and accessible for users over extended periods of time, typically with minimal downtime. In a highly available system, redundancy fault tolerance and failover mechanisms are employed to ensure that the system remains operational in the events of failure. Failures happen at different levels. One level is at the system level. A system failure can mean a VM or just a disk crashing or even one of the Kubernetes pods that run Elasticsearch, in this case crashes. Another level of failure is regional failure. Events such as natural disasters or a faulty deployment can bring down an entire region that runs Elasticsearch. The main step towards availability and fault tolerance is redundancy. One might think that since Kubernetes stateful sets manage multiple pods based on an identical container spec, running a stateful set with many pods will automatically provide high availability by just adding redundancy, but this is not really true. Consider the stateful set with three pods here. Fault tolerance means that if the application means that the application is able to provide the same service even if one of those pods fail, but that's not true for stateful sets. Depending on the application that the stateful set is running, each persistent disk has a different set of data. So if pod zero fails, for example, although pod one and pod two are still running, they will not have access to pod zero's data and therefore they cannot provide the same service that pod zero was providing before the crash. So the service will not be available until pod zero is recovered by Kubernetes. In other words, although we have redundancy for the pods, we don't have any redundancy for the stored data on the disks and the lesson learned here is that Kubernetes does not provide data redundancy out of the box and it's up to the application to replicate the data. Taking elastic search as an example, the data stored is sharded by elastic search and each chart can have multiple replicas. Elastic search has a mechanism to distribute the primary and replica shards across the disks in a way that if one of them fails and we lose access to some primary shards, there is at least one disk that is able to provide the same service by promoting the replica shards to primary shards. Looking more closely at the infrastructure, we have zone aware elastic search clusters, meaning that we deploy our Kubernetes cluster across three availability zones and when elastic search distributes primary and replica shards, it distributes them in a way across these zones that the primary and replica for one shard do not end up in the same availability zone. We use node affinity rules to ensure that two pods from the same elastic search cluster do not get scheduled on the same node. And for the Shopify core elastic search, we change the GKE node so only the elastic search pods are deployed on them. And the elastic search pods are given the right toleration so they can be scheduled on the right GKE nodes. With this zone aware setting, we can also do maintenance more quickly as we can bring down an entire availability zone for just maintenance without being worried about data loss. Let's take another look at an example GCP region where Shopify core and its elastic search cluster runs. We have the ingest pipeline updating elastic search when changes are made to SQL records. We can see here that search queries are sent to the elastic search through a routing layer meaning that queries made to shops in a certain region are routed to the elastic search in the same region. We mentioned before that the first step towards high availability is redundancy and to mitigate regional failures we should replicate our system to another region. The active SQL instances and Kafka topics in different regions hold different data sets and we need to replicate their data between regions so both elastic search clusters will have the same data set. With this interregion data replication, if one of the elastic search clusters goes down, we can fail over the query traffic to the elastic search cluster that is functional until the other one has recovered. Another requirement for our system is scalability. We need to be able to adapt to changes to the loads put on our infrastructure. An example of a high load event that happens regularly is running re-indexes which often means writing a large number of documents to elastic search to build an entire index. The indexing rate during a re-index could peak at 500,000 documents per second. Another example of increase of load on our infrastructure is high volume commerce events like flash sales or BFCM as I mentioned before. These events can cause a high indexing rate of 100,000 documents per second. Scaling up compute resources and releasing them for stateful systems is not a straightforward task. And we often pre-provision CPU and memory before these events start. These events also impact storage utilization and elastic search can get full and block incoming writes if we don't allocate more storage to it. And we have handle storage scaling by adding the logic to our custom Kubernetes controller. And here let's take a look at how our custom controller handles storage auto-scaling. What you see here is an example elastic search stateful set that has 120 replicas each of them having quarter writes of data. Part of the volume claim template in the stateful set spec defines the storage class that should be used for the volumes attached to the stateful set pods. Kubernetes provides a feature called allow volume expansion for storage classes, which if it's set to true the disks created based on that storage class will have the ability to be seamlessly expanded. As I mentioned before, our custom controller manages elastic search custom resources and reacts accordingly if an elastic search resource is updated. By modifying the elastic search resource and changing the volume size, for example, from 4 terabytes to 8, we tell the controller that the disks for the elastic search stateful set need to be resized. And the controller updates the elastic search stateful set with a new disk size and consequently updates the volumes attached to the elastic search stateful set pods. Since these volumes allow volume expansion feature, they seamlessly get replaced by larger volumes by Kubernetes. Before using this feature, the allow volume expansion flag, we used to replace disks one by one and draining every single disk, deleting them and creating new ones. And using this feature allows us to scale up the Shopify core elastic search cluster in just about like 10-15 minutes, while it used to take about 6-7 hours. Our merchants and buyers are spread across the globe today and we are present above 170 countries around the world. Before COVID, Shopify ran in three GCP regions in North America only, one in Canada and two regions in the US, which were the period that served online store functionalities for all merchants and buyers. After COVID, many businesses went online and a lot of sales became online as well. This led to an increase in the number of merchants and buyers across the globe and latency and data locality and compliance with geographical jurisdictions became constraints. So Shopify decided to expand to other regions of the world. Today in addition to North America, Shopify runs in Europe, Australia and Singapore to reduce latency and provide a better quality of service to the merchants. And also buyers and search platform has followed the same pattern and we have brought our infrastructure closer to our clients to provide a better search experience. Search platform runs many elastic search clusters across the globe. You see them here. Each small square you see in this diagram is an elastic search stateful set pod and as you can see, many instances of our custom controller managing and deploying elastic search stateful sets in different regions of the world. And yeah, they are marked by red in here. And as of today, search platform manages above 100 distinct elastic search clusters, some of which are as large as 216 node clusters and many are as small as 3 node clusters that provide search for different apps at Shopify. Together they store more than 3 petabytes of data. And just so you know, one instance of Shopify core elastic search has around 400 billion documents using around 400 terabytes of data. The indexing rate for the real-time pipeline can peak at 90,000 documents per second and for the reindex pipeline, it peaks at 500,000 documents per second. So to summarize, in this talk we reviewed that redundancy is key for high availability and we learned that we cannot rely on Kubernetes to provide high availability for stateful systems out of the box. We also learned that search infrastructure at Shopify and how it provides redundancy by using a search engine like elastic search that has built-in fault tolerance and also by replicating data between GCP regions. Shopify's custom controller for search was also introduced and we saw how it can help scaling storage for stateful sets in production. Thank you all. And if you have any questions, I think there are two mics over there. If you have any questions, you can ask them. Thank you for the talk. It was very interesting and very engaging. Also interesting to see how that is working within Shopify. A question. If I recall correctly, there is an official elastic search cloud for Kubernetes operator which supports a lot of the functionality that you were describing. I'm curious what prompted you to choose to implement your own operator? Was there a specific set of features you were missing or something else? We started building the controller I believe before. Elastic started presenting that and it's also about the licenses. We had already been working on it and it was working really well and we didn't find the need to go with the enterprise solution. Thank you for the talk. I want to ask how often do you have a catastrophic failure and you need to recover data from the charts? How often do we do traffic? Basically whenever we are paged. It happens that for example there is a large flash sale is happening and a lot of orders are coming in and indexing on a specific node goes high. That somehow stalls other rights coming into the elastic search cluster and impacts other merchants. That's one example. We fail over traffic from one region to another to mitigate the impact for other merchants. You had also a question about charts? No, but that was the answer. Thank you. Thank you for the talk. One more question about the pipeline. Do all applications only write to MySQL and wait for the data to get into elastic search? Or do sometimes if say the application needs to get the data right away, background elastic? Do they write to both and eventually reconcile in elastic? Yeah, so the pipeline that I shared was specific for Shopify core. For that we use our only source of data is SQL. So nothing else writes to the Kafka topics that we consume from to write to Shopify core elastic search. But we do have like as I showed we have a lot of other teams at Shopify other than core that use search and use elastic search. They are free to use other sources and they are a bit more low profile. So they are free to write in any way that they want to their elastic search cluster. They are isolated so not everybody at Shopify uses Kafka to write to elastic search. It depends how fault tolerant they need their system to be because Kafka helps like if things go down you still have the let's say your elastic search goes down. You have according to the retention for your Kafka you will have around like 24 hours of data is still like in the Kafka. And like it really depends on the use case of the elastic search. Okay, thank you. I also have a question about the retention but thank you. Yeah, thank you. Hi, thank you for the presentation. Just one clarification when you showed the overview of the clusters in Europe and the US etc. You gave a lot of statistics and one of them was I think 400 million documents per chart. What could you repeat that one because I couldn't get it. So taking a look at here like take a look at the one at the top left that one that that's one instance of elastic search. That's for Shopify core in that entire elastic search cluster. We are storing 400 billion documents. Billion. Yes. And in terms of storage, it takes around 400 terabytes. It's only for that large one at the top left corner. Great. Thank you. Yeah, thank you all so much. Yeah, have a good day.