 afternoon can anybody everybody hear me clearly good okay some housekeeping jobs there are seats out here if somebody comes in please feel free to tell them there are seats up front okay before we start I wanted to ask few questions from the audience how many of you guys use tiktok that's it it's difficult yeah that's what I was wondering can you hear me better now good okay again how many are fused tiktok now that's better how many have you downloaded and used once maybe okay there's some hand showing up there okay so this talk is about accelerating the recommendation APIs so when you first open the tiktok app you actually get a for you page these are the recommended videos we suggest to our users and it's important for users to have this experience when they first open the app that they get that lightning fast so this technology which we working around with open source tools accelerates those APIs so what you can learn today is how we use some of the open source software from this community and how we plan to contribute back to the community around our learnings just for the background I've been coming to cube cons for last few years trying to figure out a business case to invest in multi-cloud and for last two years I couldn't find anything because we built our own data structures we sorry data centers we build our own edge sites and we can do it way cheaper than cloud this is the first use case where we are investing with multiple cloud providers and we see value in this so all the product managers who were hanging out with me here at the board meetings and CDO summits thanks to them that they encouraged me to look at this so this presentation is our first attempt in this community to start seeding some of these conversations around our learnings how we use these tools and we want to contribute back so please be kind to us we don't know the ways around here right so and feel free to ask questions because this is very interactive and we don't have any agenda here so it'll be very lightweight it'll be mix of a little bit product and business like if you guys want to think about multi-cloud journey how to make business case how to figure that out and a little bit around technology right I have to tell you that we don't have a demo today because this code is still closed source we plan to open source to that code and then we're gonna demo it because that makes more sense otherwise it'll be just a show off make sense okay so before we start I wanted to do the engineers here I'm honored to present their work I just make business cases and investments but these are the guys who built it we are not tech talk we are the infrastructure team inside the company so our customer is tech talk so just so that you understand what we do right I already explained that we built data centers around the world and I'll explain how and how we run our own websites but in this particular case we are using some of the cloud providers to do it so we talked about all this but just to understand the audience here I wanted to understand who wants to learn about us how many of you are from cloud providers or work with cloud providers okay majority good so I think this this will be very interesting then okay let's start so this is global service accelerator it does few things around the value side it improves our security availability and performance what do we mean by these things we'll go in detail right but let me dive into this first slide the Internet itself the the public infrastructure or we call as public Internet is not that fast big companies have invested a lot of money building their own infrastructure figuring out how to connect to the ISPs how to get to their end users fast we are one of them but still it's very very hard to build this infrastructure globally so one of the key things we saw that cloud has invested a lot of money in the backbone and when we move traffic on backbones it becomes very fast what we mean by very fast is the latency drops for our customers and we measure these latencies around round trip times and that's how we know so initially when we were looking to understand how to use cloud we did these early MVPs in Brazil to move traffic to North America and see is it for real that if you use somebody's black bone it's faster and there's those were very promising that's made us invest more around this idea but if you look at this slide it shows a very simplistic picture this is not how the Internet is laid out and this is not everywhere possible but in certain regions we do have users connected through internet service providers to to data centers which are owned by cloud providers and they have very very strong backbone to route those packets or traffic to our data centers please I try to understand that this is a business case which works for tiktok it may or may not work for everybody right because the scale we have we can find value in these things and save money and also increase performance for our applications so please do your own due diligence as we go through all these slides just a disclaimer right but in a sense what we are doing is we are trying to hack the clouds backbone for our own benefit that's what it really is right so when we move the proxy near to our end users there are few things which happen naturally and we'll go in a little bit more detail around security posturing because whenever we see traffic which is not ours or there's an attack we can drop these packets faster instead of taking cost on our infrastructure because clouds pricing is expensive when you move traffic inside them so this is global service accelerator just to show off hands has anybody used any cloud providers global accelerator okay three four okay good so probably you probably will find similarity in this but once we open source this code you don't need to use them you can use this what I'll do is basically for us what it does it provides us neutrality across cloud providers and regions so we can go in any region or in cloud provider and where we see best value in terms of price of performance we can put this out there right but it's very simplistic in the configuration and you'll see how we have evolved it it's gonna look very similar to what other cloud providers provide you okay so this is a real picture for how we distribute traffic across the world there are three big data centers we have and these are located for the users privacy when we move traffic for United States it stays in the United States all the users are connected inside our data centers inside that same country these are for compliance and regulation purposes but it also helps us save money when we do these things but there are three big data centers we have one is in America one in Europe and one in Southeast Asia so typically when a user connects and to tick-tock the geodinus resolves it to a specific address and that address is kind of directed towards where this user is coming from if it's America's it's gonna go to the American data center this is how our traffic is routed towards the origins which means that some of these things which we do with recommendation they need a lot of compute and a lot of power so they cannot be easily ported out to cloud providers although we are trying to do that for our own performance reasons but there's a way that how we route this traffic to origins still maintains that sanctity of users in North America only state within North America so when we insert GSA what happens is the picture changes a bit because we are moving out to edge and the definition of edge I haven't understood still it's been few years but for us edge means something which is closer to our end users that's all it means really it doesn't have to be a mile away it could be a region somewhere so when we did did these experiments I was explaining about Brazil we are now expanding to other countries right so you see we pick up the traffic from South America we use cloud providers backbone and hit origins in America right and we see clear value in there when we do this because our users latency drops and that recommendation feed gets really fast so people are more attracted and more growth around tiktok when people click on more videos and watch them and so same principle applies across all three regions you're up the same way we are taking and picking traffic from all the different countries and moving towards those zones where compliance needs are right we have just started on this project we invested late last year so we are already in production but we are expanding it globally so we're gonna start with us you're up and that's why it says 2324 it's not complete yet we are accelerating recommendation APIs but there are other types of tiktok products which some of you use like live and there are other things which also will be using the service later so this is very for me it's very complicated so I'll try to dissect it for you hopefully I'll do a good job but if you look from left to right this is our actual deployment it may or may not be your deployment there is a certain reason why we did this way and we can go and ask you can ask more questions from me and we have engineers here who built it so they can answer your questions well but there are three types of traffic streams we get from users HTTPS quick in a web socket here the picture shows two cloud providers cloud provider one and cloud provider two these are two big gray boxes on your right when the traffic comes we get these anycast IPs which we conveniently rely on cloud providers to provide so per cloud provider we get an anycast IP and we know in which region which is the closest pop for that cloud provide it hits the load balancer we know from our control plane that's the boxes on the top we can program these cloud controllers for the specific IPs so we know exactly where this user is coming from and where it's getting directed to once through the load balancer we mostly use the layer for load balancers it goes into our proxies what we call as global service accelerators that these are not these are maybe we can think of this way these are high performance engine X or NY for lack of better word because at our scale when we move millions of these queries the technology starts to not work for us so we did some improvements around there and that's what we're going to see our contribution would be to the community when we move and push such high volume of traffic so we use cloud providers communities we think it's best managed by them shout out to all cloud providers out here I think they they do a pretty good job with managing this layer for us the value here is that we want to control our traffic encryption decryption users privacy those are important for us so GSA does protect that well we care about their latency their performance communities and managed services like the load balancers we think as a commodity for us once you hit the GSA there are a few things we can do in there we can not only measure how the end user performance is because we have client-side code to measure that but we also want to know at each leg how much the latency is or how much round trip time it is so we do that measurement but more importantly we actually program security policies on this the most generic function you will find is this wire application firewall and DDoS stuff which is very easy to do that right the GSA but more important is like advanced program like hmm there's a threat around here how do you detect it how do you program it and we believe those things are more tuned towards how we see edge is going to wall into being function as services where you can write simple scripts and program this logic and there will be an output for what you desire out of it so GSA is a combination of all this not only engine exproxy a high-performance thing how we wire it in the network how we use the cloud's backbone but also all these programmability which we care about again as I said this may or may not work for everybody but this is how we want to maintain it because we want certain amount of control in managing our user experience and managing privacy for them and of course this is like an end-to-end encryption so we do all kinds of certificate management and rotation and all that cool stuff we're still working on a lot of things because we haven't figured out everything but this experiment is yielding us a lot of positive results around cost savings and also performance and again it shows a very specific deployment which we have today there are certain reason reasons that exist because our recommendation algorithms need a lot of compute and we still haven't figured out how to use clouds resources for that but there is no reason any company has to do this the origins can be in point within the cloud provider so you see those origins which are the on the right side they can move inside the cloud provider just giving you a sense of how flexible this is this deployment pattern is depending on your needs you can change and modify okay so just to summarize we build this because we want neutrality around all cloud providers cost performance matters to us more than cost performance is most important if we find value in having lower latency with certain cloud provider will probably use that even if it is expensive so the value side is it allows you to choose the best value for you per region based on endpoint cost or the routing cost if you have used cloud then you would know that the charging on the routing is very high on the bandwidth side so you can be smart about which cloud provider to use where it also gives you a unified operational maintenance so you don't have to go to AWS or Google or Microsoft Azure by the way I'm taking these names that doesn't mean we are their clients it just means that you don't have to have dev of steam maintaining all these various cloud providers you can own your destiny through having this unified operational maintenance and the key things which we measure is like this average RTT per service how well our users doing right the one thing which we do use by the way from cloud is this auto scaling features which have been over the years developed robustly and they help us to scale out because our traffic patterns are spiky based on region based on time of day based on events so we want to rapidly scale and descale so we do that in there the other vector which we care about is the security piece as I mentioned there's full path encryption in here automatic certificate management and we integrated these commodity functions like application firewall and DDoS in there because it's best to drop that package sooner than moving it over your infrastructure and pay a cloud provider the one thing which is we are actively working on is this function as a service component I mentioned a little bit about around programmability these program programmable functions I gave you just few examples around but it's up to your creativity what you can do with these proxies and these programmable functions we use it for our security management right inside the company we use for what is called as head enrichment when you want to put a route label right around any of these request response headers we also use it to what I mentioned as privacy when and we'll go through that use case when certain governments or certain regulations ask us to maintain that third parties can program that and where they actually decide where this data should be served out of or where it needs to be stored and that's what is so easy to do with such a technology and as I mentioned security policies that's very easy to do when you just have to write a script and deploy it and it works so this summarizes some of the key benefits around GSA I'm a product guy so pardon me I had to put this slide in but this is how if some of you want to make investment then you probably want to make this slide for your management to make an investment there the one thing which was clear in discussions early on that with tiktok was that they care about not only improved but stable latencies and that's very important to understand that you have this advanced telemetry where you can pinpoint the reason which leg is suffering and why so we have controls around our user device side we have these controls around the load balances telemetry the proxy and using the backbone to the origin so we want to monitor that consistent performance and that's important for us more than anything else savings this is the most important value the saving side came in later when we realized that if we pushed the graphic through here we can also save some money so we measured two things around there the queries per second how many packets queries are we serving for all recommendation API and what is the thorough put on that compute which we are using from any of the cloud providers that means how many transactions per compute this number will give you abstract in a way whether the health of the product is good or not great just so that for your cost calculations some companies care about it some don't but this is one of the metrics we cared about and one of the most important ones was the time to deploy when we expand across regions especially our growth has been phenomenal in certain countries we want to expand rapidly so time to deploy matters and that's why the use of these managed services where we a very small team we don't want to invest on the commodity side we think that should be easy if we were to give that liability to a cloud provider and so this is another metrics of why our choices were like this okay now this is what we want to do for community and I am looking for feedback from this community and we want to engage in a way that we understand the user needs of better but for tiktok we do the hand-holding for them because it's a big customer and it has needs beyond let's say a small website or an app would need but we think the configuration workflow should be very similar to what other cloud providers do right but it should be free so we think if we can give you a workflow which is as simple as configuring accelerated IPs which could be global or per cloud provider your choice right and you can enable some of these commodity things like dog DDoS protection then that makes your life simpler the only thing you have a provision is on the back and side as the listeners where your traffic is coming to right there could be layer 4 UDP TCP ports whatever filter you want to configure you can configure a listener there there are certain companies which use layer 7 application signatures which are more advanced where you can insert a token from a user client and see whether this client is hitting this particular proxy that also is possible in that and eventually you want to figure out an endpoint group where your traffic is landing them whether it's within the cloud provider or on the origin site and you do want to have these endpoints which can auto scale because when the stress develops those are the points in our cases the recommendation engines right so we do want them to auto scale and then based on the traffic they should be able to dial up and dial down otherwise your costs are going to be exorbitant if you keep on reserving the capacity that way so if we look at this configuration workflow it probably makes sense for somebody to use it this way we're still working on this it's not complete for open-sourcing efforts but this is how we think it should be laid out on the other side of visibility we think there are few things we want to monitor and we this is not a complete list by any chance but we want to see the packets to report the in and out we want to see the bytes we want to see the layer 4 connections and we also want to see the layer 7 connections all these are cost related metrics as well if you deal with any cloud provider right and that brings me to the second point we in aggregate we want to also give a picture when you work across cloud provider how much does the endpoint cost you there right when you serve that much traffic out of that or how much is the routing cost which includes both the load balancer and also the traffic serve because the cloud providers have different policies around how they use the egress or ingress bandwidth and the most important thing which we care about is this rdt because it's fundamental to our business so we care about the average rdt from both legs from the user endpoint to the server endpoint okay so this is this is the thing which I was talking about before the function is a service why is it makes sense for edge this this is looks very simple picture but it has a lot of these complications around compliance where we have a team which is managing users safety and data they are the one who are authorized to program a policy break we also have partnerships with our providers who are trusted cloud providers and they monitor it what's happening so in this picture we are simply showing a way where this particular piece of software you can deploy at the periphery of the cloud provider program a policy and hand over the reins to somebody else whom you trust right and this way we think these fast functions will be phenomenal for for a global privacy concerns which people are facing today where people can use this open source code and solve these problems so coming to the last stage what is our plan here you are in the talk today so we're presenting our first attempt of using multiple cloud providers to solve a problem for us we want to contribute a case today around this so that people understand in greater depth of what this was all about and how we solved it in 2024 we plan to open source our GSA these are some of the proxies I talked about the high performance proxies and these proxies will work across multiple cloud providers the final stages we want to open source our fast which allows you to program all these policies and you can be as creative as you like around how to program them because they are very simple to do so this brings me to the conclusion of this thing but I wanted this event to be more interactive so if you have any questions I have the engineers who worked on this technology I think those guys can handle business related product related I can answer thank you yeah go ahead how does this how does this approach enable faster throughput or latency on the backbones thorough put in terms of compute or traffic both things matter to us thorough put on the connection side of how much traffic are we pushing depending on the type of application it varies for certain applications the latency is more important and thorough put so we program it that way for certain applications they don't care this is like hey I'm uploading images so we'll program it that way so that's a choice I have a question two to three questions one is because this is such a big application service used by like people all of the word do you see like like load fluctuation like surge in a load within seconds or 10 seconds which can cause high latency in your application I don't know James you want to take that yeah I can yeah this is a good question actually when we deploy our system so basically we have some buffers we can handle a certain amount of a spike so it's okay so normally so the short-term spike won't cause a big delay yeah so otherwise so they basically you provision in the capacity it cannot meet the requirement so for our case we always have buffer on top of that and second question is so it seems like you are having like multiple like cloud like data centers multiple regions are you applying any kind of like optimization which is applied to this multi-regions with some global knowledge yet we do actually we have multiple optimization for individual platform we may cloud providers so for example when we deploy our proxy on top of our Kubernetes so for example our GKE or AKE or whatever so basically we have to tune some parameters to get a better performance so basically when we deploy it will depend on which platform are you deploy on then where you're different parameters and the optimizations on different the network yeah so in some area we can span for there for no the balancers just pass through the balancers use TCP UDP directly go to our proxy so our policy will handle four there are seven from energies but in some area or some regions because so they are seven no the balancers can get a better performance because it can terminate the they are seven traffic at the age more close to the users so in this case we are spying for may or seven no the balancers so so what I'm thinking as I was the way look at it the quality first because the quality is most important factor for so we all made the factors the quantities so yeah depends for example in in the North America they doesn't matter on your seven therefore no the balancer basically the quality is comparable is yeah it didn't say any difference but in some location for example Southeast Asia so then they are seven therefore the balancers are big yeah difference thank you okay thank you for the session the two question I have first is so GSA one of the key goal is the performance so just curious like what are the open source technology are using in your custom GSA to boost the performance like the two things I'm looking one is like what is the size of the biggest cluster you have in one of the region like one of the continental USA US of Kubernetes for GSA and second is like are you doing a lot of caching and what is open source technology are using that so the first question sorry and the second question I have is like the end server where I seen the architecture is outside I know that can be a business this is a not to move it to the cloud but I'm just curious is there any technical constraint you see with the cloud why you had not moved to cloud so what's a driver still not moving to cloud actually that's a second question let me take the second question but so it's a business calculation the cost of running these recommendation schemes on cloud unless we figure out a way to optimize on cloud it becomes expensive so we are working on it it's not that we don't know the value of it because we closely monitor even the latency is coming out of a GPU right so this is the hardest part the network latency if we solve this problem the overall experience goes up so we know that but yes it's not it's cost prohibitive the cost is a main driver correct okay yeah cost it's the cost so I will answer your first question so currently we're just start in production so you know phase one so probably we have a couple four six clusters so we taught how around 40,000 CPUs around yeah to 50 but we are growing very fast so probably next year we so in this architecture so the number of CPU is not a big issues because that you can just spend as many any cobalt cluster inside of the region so yeah so I'm sorry so the current all US tick tock you I'm just an example this is a live for the older US no no no this is only for the API but for for for caching caching we have another similar system so this is for you can think about this in is looking like a dynamic content acceleration so it doesn't handle the the AMD on the files for even the high education system is another system a similar yeah but it different and what's the size of the cluster you mentioned sorry I didn't get that for you mean what how many nodes you had in the cluster currently we we currently I can say probably one basically we will keep our cluster at a basically right side for say we normally keep our one cobalt cluster around 200 to 500 nodes so we don't want to keep it big clusters because our not a bouncer can hook up to multiple cobalt clusters then we can maintain easily maintain so ten on ten of one clusters so yeah it's not the easier for us to maintain yeah multiple cluster rather than one bigger clusters all right thank you so much all right can you talk a little bit more about what it's like operationally to run this do you have service level objectives SLAs have you experienced a major outage that you can talk a little bit more about yeah actually the good question so for SRA basically we because we have well actually we have a multiple similar system provide a similar service for this solution we are build a series based on public cloud providers so we also have some series we just use generally use a commercial service right so basically we are have end to end the monitoring to monitoring this SRAs so but I yeah I'm sorry I don't directly remember the SRAs because that SRA is handled by another team by the SRE team yeah so basically we are provided comparable even better as a agent commercial product thank you yeah tick-tock is one of the biggest growing so that's why it costs we meant to maintain it but yeah all other apps yeah so basically any API heavy applications you can just solution yeah yeah well thank you very much for hanging out with us today okay thank you