 Hi. Hello. Welcome to the Elbaz presentation. So I am Prabhakar. I work for paper. This is Anand. Yeah. I joined the Elbaz team like recently. So I wanted to share like what all the things that we have done, putting Elbaz in prod in paper. So the agenda is, I will first explain how the Elbaz is deployed within paper and then the customization that we have done to make it work for our use cases and the integration with the DNS. So some numbers to show like the OpenStack in prod in paper. We have about 8000 hypervisors, which translates to like 400k cores running 82000 VMs. And we have hundreds of Elbaz, which resulted in like thousands of WIPs being created. It's actively created and most of the time in QA you can see WIPs being created and deleted often. So this is the architecture that we have at paper. So we use more than one vendor for the actual Elbaz. So each provider will have their own way of integrating with Elbaz. For example, for the provider who has the LB device A, whenever there is an API call to create a WIP or add a member or pull like that, the corresponding plugin, in case the provider has implemented that it puts a message in the bus saying that to create a corresponding WIP or add member or whatever action is. So the provider has written some agents, which is actually polling on that bus, which reads the message and makes the corresponding call to the backend LB so that the actual action happened. So here, to have HA, we are running by design of this particular provider. One LB device is managed by one agent. So if the agent goes down, the thing won't work. So how we are running is like we are running for the same LB. We will run the actual agent in HA mode using pacemaker. So if one of the agents goes down, the pacemaker identifies and another agent comes up and then reads the message bus and then the work is done properly. So once the LB does the actual work, it puts back the same message in the bus, which is written to the actual Elbaz DB. For the other provider, they have provided a controller which takes care of all those things for us, which is like a simple stuff. But the controller is running not on HA mode because there is no asynchronous message passing involved here. So we have our own implementation to have the controller in HA mode. So this is the overall architecture. So when you talk about the enhancements that we did or the customization we did on Elbaz to make it work for PayPal, so we introduced this IP reusability for Vips. I will go through what that means. And we added SSL support for the creation of SSL certs and actually attaching the certs to the LB. And we add some more customization to the member health monitoring for the LB members. And we integrated with NOVA whenever the instant release happens, the correspondingly the member is removed from the LB. And we introduced some of the scheduler changes. I will go through each of them in detail. So basically, in a normal case, the same IP you cannot use within a tenant. The same IP cannot be used for two Vips. But we introduced a feature in which you can have a port to a map more than one Vip per IP. So these are all the SSL things that we did. We have all the credit operations for certs, the private keys, and the cert chains. And we have the corresponding UI changes in the horizon where you can go upload your certs and keys. And the same as associate and dissociate SSL cert to the Vips. We have both CLI based on HTTP APIs and corresponding Python CLI client to. As I mentioned, we have the horizon UI customization where you can go and add, you can do all those things in the UI too. So when it comes to member health monitoring, currently you have to create for each LB a monitor. But we have a flag shared in which when on onboarding time itself, we can say this monitor is shared across Vips. So whenever you create the Vips, you can select the shared monitor to be used. And currently there is no, the V1 LBAS thing, LBAS API does not have customization for getting the receive string from the whenever there is a health check happening on the members, what it has to check for the receive string. Currently in the UI and there is an API also to say what the actual receive string should look like. So we have done, there are a lot of things to do in the scheduler, but we have done some enhancements so that it solves the purpose for us currently. So we can do, basically we can do scheduling based on the VPC, the virtual private cloud. For example, we run more than one VPC. For example, you can say for Dev or QA or some other stuff like external users. So for, you can basically each VPC is going to have a different LB device to be managed. So you have to first select which LB to be used for this particular VPC. So it works exactly like a normal scheduler where there is a filter scheduler which based on the VPC it selects a bunch of LBs. And based on capacity of the current LB it will select what is the available, the LB device which has the lower capacity will be allocated and then the corresponding thing will be happening. So basically the addition of waves or other things will happen on the particular LB. This is exactly like we were getting hypervisor for a particular instance creation. So this is how it look like. So whenever there is a need to do some VIP creation or the pool creation, it goes through this list of schedulers. For example, as I told previously, we have the VPC filter scheduler which give you the corresponding provider. So based on that provider, the plugin that I showed two slides back, the corresponding plugin will be used so that the actual VIP is created in the corresponding LB device. So it runs through the list of filter schedulers and then finally the LB device is selected and then used. So the DNS integration, whenever you create a WIP, the corresponding A record and the PTR record will be created on the DNS. So for that, we use the same message bus. So we have our internal product code named starch and in some asys we have starch running and some asys we have designate which consumes this notification and creates the corresponding A and PTR records in the back end DNS servers. So these are the works that currently we are doing. There is no bulk APIs. For example, you cannot give a set of WIPs to be created at a single time. You have to go through a loop, the corresponding the API, the Newton API that you create has to do that thing. But so we are working on a bulk APIs. And we are also, we have done already the composite APIs in our load balancer management system where you can tell the whole set of your pool name, your members and the ECB checks and the members how they are going to check in a single JSON and then it goes through the whole thing and then creates. So we are getting that in the LBAS too. And we are doing some migration in the V2 like currently we are running on V1. And as I told, currently we have only two filter scheduler, one for the capacity and one for the VPC. So we have to do for other things that I mentioned like SLA or tenant-based scheduling. And the quota support, for example, here now there is no quota support for the tenants. For example, this tenant can create only this many widths. So we are working on that quota support also. And other major thing that we see is that sync issues, when I say sync issues, it's like you create a width using the corresponding plugin and the actual LB is doing the health check. And for example, say one member went down, that data is not actually reflected in the LBAS DB. So you query LBAS, it will say like these are the members up and running, but in actual case there is one member down. So we are not able to propagate that back to the LBAS to see like whenever you query LBAS it has to say these members are down. So those are all the things that we are working on. I think these are the developers from our team the friends from eBay and the Mahan Bharath are currently working on LBAS in paper. I think that's all I have. I think we have more time for questions so that we are here to answer like whatever questions we have. So before that I have some questions. So how many of you are running LBAS in product? So others you have their own management solutions or because we are trying, we are having some our own custom solution and we are trying to migrate. Like we have migrated some of them. So there are some challenges that we faced on migration also. I think one of my friend from eBay they gave a presentation on that that Vancouver submit. Yeah, questions. And here Anand is there and our PM, Product Manager, Anand is also there. Yeah. Yeah, sure, yeah. Yeah, we have plans. Yeah, we want to push actually. So there are plans. At least we thought like we will have it in GitHub so that people can comment. So before that we had to go through the process like actually let me start. Not yet, not yet. But we can do that but we have to go through the process like you have to have a blueprint and then stack boards and stuff. Yeah, that we can help actually. You can meet after this and then we can help. Yeah, couple of my developers also talked about in the Vancouver Summit where the UI also customized to include SSL certificate uploads and then finding to the whips and stuff like that. But it's a matter of finding the time. So you could imagine actually when we were a European company, eBay and PayPal, we have a larger footprint of the load balancers. The load balancer service itself, we wrote it for two and a half years to make sure it is taking both eBay and PayPal production today. And you can imagine how many whips and pools that we have and platform as a service on top of that that we've also built it over the years. So if you want to migrate from the existing load balancer service into, we call it as a LBMS, load balancer management service to LBAS, that migrates in itself a big project for us. So basically identifying the time for the developers, you can imagine actually only four developers now for each company. And how much time you will have to satisfy your business needs versus how much you can push it upstream. But definitely we wanted to push that upstream for sure. And we have some breathing room now and we wanted to definitely do that. To answer your question, yes. So we'll share our information to you and then we can work on that. Cool. All right, anything else? And also we looked at a bunch of other things also. If you look at SSL certs, we don't even have SSL APIs, but we had to wait for the community to make it happen. That took more than a year to come in. But we had already put together a project to move away from your homegrown community LBAS. We can't wait for that to happen because your project is going to be in jeopardy. So we took a lot of effort to implement all of those. Of course, we had to partner with our vendors also to make it happen. And now it's part of Liberty and then Kilo. And we can take this code and then see actually how much of a difference between our API version and then the community version. And people started integrating with all these APIs also other layers. Now you have to ask them to make it. They've asked tons of questions. Hey, you just told me six months back this API. Now you're asking me to move to other one. And what I'm going to get out of it. So there's a lot of internal things also you need to deal with when you work with communities. And I'm sure actually everybody is doing the same thing. If you don't have the well-defined APIs and... All right. Anything else? Yeah. Yeah. So actually we have a conspiracy where you can do customize the health check itself. There are health checks. It's there in LBAS itself. And where you can... You will have, for example, 80, you run something. Then it will tell... DCP or HTTP, right? Yeah. But that was a community version. But all of our eBay and PayPal, they just don't work with, you know, HTTP and ACP. Yeah. Exactly. So we have ECP monitoring that community didn't have it. So we have to implement that, right? Otherwise you can't migrate some of our VIPs and then monitor these APIs. That's one thing. And what was the, you know, another big challenge that we had in terms of... No. So say we have a clear API where you could query the load balancers and it will exactly tell you the member status. You know, Newton doesn't have it at all. But without that actually, you know, you can't live in a dark where, you know, what's the health of your services, right? So things like that. So previously the LBMS that he told is actually going to the LB and then querying the state. The LBAS has its own state now. That's the one that I was talking about, the sync issues. The LBAS state is not actually reflecting the real state of the LB. So that's the one problem, that, because when the LB goes down, there should be a way in which the LB can publish the same message but that the LBAS is looking for and then update its member status. That's not yet there. So that's one thing. And also, you know, if you are larger in an interface where you have hundreds of hundreds of load balancers and these are being managed by, you know, some different system or maybe some kind of scripts also, like, you know, layer seven rules, if you have complex layer seven rules. And if you are going to be dedicating LBAS to completely manage those LBs, you can't change anything in the back end, right? And how we are going to be, you know, taking this data, because your source of truth is going to be the device. You could say, hey, I'm the source of truth. But, you know, there will be, you know, difference between the actual set of infrastructure versus what you have in the database. How are you going to be making sure that, you know, somebody is, you know, changing, beginning the scene and you need to make sure that actually your database will stop to dead. So we have built all those, you know, corner cases in our, you know, Co-op Grown solution, but LBAS doesn't have it. But if you are in the path of migration from your Co-op Grown to the community, you have to keep these people out of, you know, LBAS, load balancers. And unless other ways you have the featured parity, whatever they are using, typically, you know, any operations teams, right? So they have, you know, complex rules always to satisfy business needs. And you can't say, hey, my LBAS doesn't support it. You can't support that in the load balancer, right? So how are you going to be making sure that, you know, whatever you are driving through, the API changes are going to be reflected. And at the same time, whatever you are directly doing in the load balancers are going to be reflected in the other side of the fence as well. So that you keep the consistency. So these are all the challenges when, you know, if you don't have the green field and if you are going to be taking the existing infrastructure and take all those load balancers and push it into that. There's a lot of money involved in it. Start like, hey, I'm going to be completely replacing all these load balancers with the newer ones or some millions of millions of dollars involved in it. Right? It's not going to be a real task. Yeah. Yeah, yeah. So we don't have... So I would say actually, you know, we don't have auto scaling, but actually we have manual flex-up, flex-down based on your traffic pattern or maybe, you know, there is a growth for your application, specifically in eBay and PayPal case, you know, you could imagine that we grow, you know, two digit every year, and you expect your applications to grow as well. And for that, we have platform as a service and then as a developer, you get into that, you know, platform as a service, you buy the CLI or whatever that is and you say, okay, I want increased in percent of my application pool. Then it spins up the VM, then it, you know, you know, roll up the code and add it to the load balancers and mark them up and making sure that you have right ECBs. So we have end-to-end automation from all the way from, you know, developer experience to the production. So we have self-services, you know, for most of our applications. So we have manual flex-up, flex-down, but actually, you know, we wanted to absolutely, you know, have the hands-off and then flex up at night one o'clock without even having anyone to involve. But, you know, that's a long way to go in terms of, okay, there are multiple triggering points. Okay, you just don't want to catch up. Say, suppose, you know, you ran out of capacity at 80 percent or 80 B percent and maybe just one spike and then are you going to be running, you know, creating 100 VMs and then add it to the pool or maybe what's your cool-down period. So you need to understand all of this, you know, different, you know, spike patterns and then act accordingly. So what is your cool-off period and do the, you know, the flex-up automatically. But for that, actually, we need to have, you know, a lot of data and then drive that pattern. So that we are definitely working on as part of platform as a service capability. Yeah, the past, guys, like actual use of LBAS when the user says, like, why this app, like, say, in 10 instances, the past uses LBAS in an effective way to make that happen. And also, you know, how much time it takes for you to, you know, spin up the VM and then roll your code. It depends on the size of your package size. It might be, you know, by the time, actually, you roll out, it will be two, three minutes. The traffic itself might be winding down that point of time. And that's where, actually, the flex-up, flex-down is one use case. Definitely, we are looking at Docker as the best way for that. And other than being completely relying on the VM and roll out the code, and, you know, it's a lot of things that needs to happen in the Africa scenario also. If you have, you know, half a terabyte of code base, you want it to push across the infrastructure, it's going to be impossible to, you know, do it within one or two minutes, right? So the application also need to get ready for the containers, right? It has to be a micro-service rather than being, you know, very monolithic code base and you can roll it out, right? So there are a bunch of things that needs to happen. Flux-up, flux-down, auto-scaling. So that'll look good in, you know, demos, but in the real world, if you are a company of, you know, 20 years of the industry, you know how much code base that you have and it won't change overnight, right? If you have a simple node app, yeah, why not? I'm sorry? The partner, like vendors, he's asking what are the vendors? Oh, no, actually, yeah. So I'll tell you, right? The load balancing, if you are in a web company, right? If you are, you know, how you deployed your applications and then, you know, your ops architecture is being evolved over the period of time, what kind of best practices you had in managing those, you know, all those whips and pulls, because, you know, every transaction that is coming into our infrastructure, it is money involved in it. Say, you could very well say, hey, I take a HAProxy and then put it on it, but making sure that it will run for the scale and a lot of, you know, other operational, you know, experience that you gained on the vendor devices are much more than what you are comfortable in HAProxy. Of course, you know, where we wanted to go, it depends on, you know, I don't want to say that actually, you know, hey, when does, actually we wanted to completely move out. So it depends on actually your business problem that you are trying to solve, rather than being religious about, hey, I want to run everything on open source, rather than being partnering with the vendor, because, you know, we wanted to liberate both the world, right? So we have, you know, best practices we figured out in the upgrade path for the devices and we are very much comfortable with that. It's not that actually we have just one and if you have two, you have always a choice. It's not that actually price point of view in terms of, you know, a user. And also, you know, vendors also will have, you know, bugs in their, you know, as a lot of loading or maybe some other issues in their, you know, firmware itself, it takes, you know, some time to fix it. Instead of completely relying on one vendor for your whole business, better to have a choice. Right, because everybody will have, whatever software will have bugs, actually. There is no, you know, secret in that. And we just want to, you know, take really, really bleeding edge and then risk our business by trying out something just like that, right? So it's not about just money. So the reliability and then availability of your, you know, services itself in terms of business point of view. So we have two. So of course, you know, different use cases. You know, we wanted to try out in a different other solutions also. But one thing actually, whatever, you know, we do, we wanted to be, you know, very, very careful. So if you're, you know, QA environment and then LNP at production, we wanted to have the unified, you know, infrastructure all the way from the hardware to software so that actually whatever you test, you deploy the same thing in production also. You could very well say, hey, I test my load balancing solution on HA proxy, but production is completely different. You don't know actually what is going to break in production if you roll it out. How effectively use all these devices in different environments so that actually you don't run into, you know, different variables. Because you, in a larger infrastructure it's very hard to troubleshoot actually what is going wrong. Say, if your transaction is taking, you know, instead of, you know, taking three seconds, if it is taking four seconds, it's a big deal for a company of our size, right? Or the business, it doesn't matter whether you use X vendor or Y vendor or open source. You always want to drive down your transaction time. That's the business code, not, you know, it's a big deal for a company. So it depends on the business trade-up that you wanted to take. But at least we can tell that we run both hardware and software. Yeah. Both. Yeah, and also, you know, virtual also, right? So virtuals, we are trying virtuals as well. So we are completely under, to be very honest, actually we are under, you know, physical devices. So that's what we are comfortable very much in steps up in SSL offload cards. So if you take, you know, another 10 seconds to process your SSL, your transaction time is already gone, the PayPal customer already left, left the terminal. So you want to save that or you want to save this, right? So, of course, you know, we wanted to have agility and we wanted to move towards virtual machines also, but we are working with our partners very, very collaboratively on that. Because today if you want to, you know, bring in a couple of load balances, put it into infrastructure rack, operationalize, you know, how much time it takes versus, okay, if you're on a virtual machine you know, spin out that VM anywhere, anywhere, anywhere compute racks and then bring them online and make sure that it's working for the routing and the layers to one and, you know, SSL terminus and all those things, you could make it happen but that's the path to get there. But we are definitely working with our windows on that. Right? Yeah. Any other question? Okay. So he wanted the... Yeah. It's agility versus performance. What is more important for you? Agility is of course very important but not at the cost of performance. Right? But mostly the vendor gives the abstract API to give the agent that is running will work for both the hardware version and the software version because their API is same as... So I'll tell you, you know, we were on virtuals for one of our vendor and we rolled out, you know, one of our availability zones that we nearly built. We wanted to go virtuals everywhere and we are very need... very much near to the holiday and we wanted to bring that particular availability zone to take our 20% or 10% of our traffic during the holidays. We test, you know, performance testing before we light it up the entire data center that we built it for that particular holiday season. And, you know, everything is same everywhere and then we started comparing actually why this particular availability zone is taking more... you know, transaction times and then more timeouts in terms of, you know, the budget we have for every service, okay, if we are trying more than, you know, a millisecond for a certain number of milliseconds you just want a timeout. You don't want to wait for that anymore. You want to go and try it out in other availability zone. And by the time actually when you switch to other one, the transaction time for the customer who is sitting in the store or maybe online is really spinning there. Right? So we tested that and the source that we introduced, we identified two, three variable solubents, we went up that. Then we just got only four weeks to replace all of that with physicals. Right? You could imagine actually how difficulty is going to be right of our size, you know. So that's what actually, I'm sure actually we rushed on that. So we should have not done that. So we had to procure the load balancers, put operations teams within two, three weeks, build all of them make sure that you have the right interfaces and then we'll light them up. We do as a nightmare actually, you know, seeing as the year end, not last holiday to last holiday. Anything else? All right, thank you everyone. Thank you.