 My name is Gurtheshwar, I am a product engineer with Plevo. I have been here for almost a year now. I have been into open source telephony for about 5 years. I was working with setting up hardware based PBX systems etc. and my stint with cloud telephony began with Plevo. In a year we have had a lot of learnings. I have seen the company grow in terms of scale, in terms of the technical tools that we are using and there have been a lot of learnings, some of which I would like to share today. The most important one being that how do you avoid your service going down at all. So we provide a rest API to make calls and send SMS, that's the simplest use case. To give you an example, Netflix is using us to create conference rooms for their internal usage. So what does the scale look like? This numbers have doubled from the time I joined, which is like 10 months back. We have done 100 million minutes of voice this year. We have over 100 plus servers that we are managing. The servers are on the cloud as well as hardware. So this is a snapshot of yesterday. It gives you the past months of time. I don't know if it's very clear, there are some orange blips there. That is when we have had some interruption, but not a service downtime. We are pretty proud of the fact that we maintain such a good uptime. So the product is broken into different services. These are from a consumer perspective, like inbound calls would be one service, outbound calls would be one other service, SMS would be one service, your dashboard UI would be one service, and this is what it looks like inside. So we have, this is a stack and we separated it out into stateless, stateful and data storage services. This is important because when you're thinking of failure, each one of them will have their own way of how you prevent failure. So for instance, to give you an example of how this is working, we use Flask as a front-end proxy. So whenever some customers make an API request to make a call or send an SMS, it first hits Flask, which is our proxy servers. From there it goes to our business logic server, which is Django, which will give you information like whether this user is eligible to make a call, does he have credits, what routes to take, etc. So from there it goes to our media servers, the stateful servers. Django, like I said, is the one that talks to a database. The proxy machines are completely stateless. There's this proxy request in and out. Django is the one that's talking to a database. So Django is also stateless. It is some contextual point that some people say that the UI servers should be stateful. They should store session data. But since we are an API company, we don't have too much traffic on UI, so it's not been a priority goal to tackle that. We use Celery for distributed task systems. We have tasks like auto recharge triggering. So suppose there's a user who says that if my balance goes down before below $50, you trigger an auto recharge for $100. Whenever a call ends, we store the call detail records into a database. These kind of things are handled by Celery. So Celery also is another part of the stateless machines. OpenShips is a SIP proxy. What it does is it just proxies SIP packets. SIP is a protocol which is the signaling layer for all telephony applications. So these services, to be honest, it's not very hard to manage availability in a stateless service. All you got to do is put a load balancer and a good to have would be if it would be automated, like it would scale up and down on its own. So I'll come back to later. I'll later come back to how we are managing each part. To talk about a stateful services, we run FreeSwitch as a media server. This is the one that's handling all the calls. All calls are parked here. They originate from here. All incoming calls come here. So this is a bit stateful. To give you an example, suppose you are on a phone call and your credits are about to expire. A user might want to send a message saying that you have 30 seconds of talk time left or you have X amount left in your account. For this, the request would come into a proxy, but we need to know where the call is. So there is that part of state there. Then there's also a thing of we offer conference calling services. So there we have cases like if you're calling from somewhere in Europe, then your call will land on an EU server. If you're calling from Asia, your call will land on a Hong Kong server. And both of you suppose both callers want to be in the same conference room. So that is a bridge across two media servers. So even things like that, that kind of makes it it's not as simple as just putting a load balancer in front of it. That won't work. And it's not that you can just pick out machines and pick in machines. We also use hardware. We don't have our media servers on the cloud. We use hardware for that primarily because it has better performance when it comes to audio calls. For data storage, we are using Redis and Postgres. I'm sure most most people would know about the problems that are and the best practices that you use here. But I'll I'll come to how we are tackling it. So Postgres will give you the usual usual things that you'll take want to take care of is like replication lag between your master and slave. If a slave if a master goes down, how do you promote a slave? How do you ensure all clients are now connecting to the new master? Then with Redis kind of similar things, but we don't store any persistent data in Redis. We store in flight data, which we have slaves running to I'll come back to how we how we avoid failure here, but it's not as critical as your Postgres database going down. So this is what it looks like on the inside. So the the web stack that I talked about proxy, flask, Django, salary, that's all here, which is fairly simple. What you do is you put a load balancer in front of it, which will distribute request. And since it's stateless, if a machine goes down, you can have some auto scaling parameters which will bring the machine back up. So this part is quite simple. What we do for stateful services is that every so every source of input to the stateful machine, we try to make that intelligent. For instance, to give you an example of a call. Like I said earlier, when you when you when you make an HTTP request to create an outbound call to a phone number, it will come to the proxy, which is stateless, which will go to the business logic server that will bring you which media server to park the call and which route to take. But there we in the proxy we have we have baked in intelligence there as if the call does not fire, the proxy will retry the next media server. So so we do the that switch or that that appliance there is trying to signify a proxy. So we build intelligence there. That is on one end. Like I said, all sources, all sources of input to a stateful cluster should have this intelligence built in so that if that request failed, it should try the next best alternative. We do this on ship by our ship proxies are constantly sending an options packet, which is to convert it to bare networking terms, you can call it a pink in terms of ship. So they constantly check our media servers if they're getting response. If not, it will automatically take it out of the production cluster. But take it out. I mean, it'll stop sending request there. So if somebody makes a phone call, which comes to our ship proxy, which is to be sent to a media server, if the media service been not responding, it will automatically be taken out from the cluster and it will not send any request there. So the point being for stateful services, we try to make our notes that talk to the stateful service intelligent and capable of retrying. For data storage, you have the problems like like I talked about how do you promote a slave and replication lag cross region stuff. Same same thing for redis and postgres both. So now we've explored a various various kind of things to accomplish this redis cluster is still far away. So there is redis sentinel that is there. There are some other ways of thinking about it like switching DNS if you are on AWS or or you can do things like virtual IPs like you have two machines running the same IP and with heartbeat among themselves and if one goes down the other one will grab it. So that way your clients don't need to be reconfigured. So these are the various approaches that we've evaluated but I'll talk about how yeah so in practice how we do it is failure detection is important. Whatever failure schemes you have or whatever you want to do in turn of failure when a failure happens. The important thing is that you detect the failure as soon as possible and you trigger your failover response as soon as possible. So for that we do proactive monitoring for that. So for instance to give you an example of the database. I think that will cover all other points as well. So we are constantly doing and doing a right query on our master database. So once that is detected down so you have a slave you want to promote the slave. Okay that post grace will take care of but how do you reconfigure all your clients to point to this new guy. The approach we've taken is that you need a good cluster management solution for that. You need you need somebody who knows about all the notes that are connecting to the DB and who from who from where you can trigger one command that will reload the configs of DB client to point to the new master. So something like salt stack or Ansible there are various tools for that. I'll also come to in my next slides I'll come to what we are using and how we how we reach the decision of using a particular tool. Triggering failures in a staging environment is very important because you can think of you can think of designing your architecture very well and think of all possible failure cases. But but until and unless you're not constantly doing it you there is a good chance that you may miss out some edge cases or something that you've not thought of before. So these things become important. You need to proactively monitor your cluster. You need to have a single point from which is which is aware of what all machines are there in which cluster and which can send commands to your entire cluster or a specific cluster. And you need to trigger failing failures in staging. So like I said over the past 10 months I've seen a lot of change happen here at Prelibo earlier what used to be that we had a shinkan shinkan is a fork of Nagia's it's a monitoring solution so we had that for monitoring everything. And we had our own implementation of fabric scripts which were capable of given a specific role say a Django cluster or a flask cluster. They would be able to spawn a new machine in that cluster they would be we would also use them to update code. Suppose somebody had to you know when when we are to push something to production we would push it to get a master and then we had to somebody had to manually go and run a script which would update code on all machines in the cluster. So the problems here were that it's prone to human error. Because whenever you detect a failure there we work with remote teams and I'm sure many people do. So when detecting failure it can be you know people react immediately to alert so there can be two people trying to run a script which depending upon your infrastructure environment it can create problems. Then there's no orchestration. This is also fairly important point. For example the database problem when you want to reload all your database clients. You want some kind of orchestration with fabric scripts it wasn't possible. I mean at least in our implementation it wasn't so so that that was a pain point there. So over time we've shifted to this is what it looks like now. We use sensor for monitoring. One big advantage of sensor versus Schenken is that earlier when new nodes came up into our cluster or when we launched new machines you had to go on the server side and tell the server about these clients that they are to be monitored now as well. With sensor you don't have to do that. Your clients can just come up and as long as they have the script that has to do all the checks and they know of the sensor master. They it will be managed so basically it can scale very well with Schenken every time some new machine came somebody had to commit to the Schenken configuration saying this new client is up. So that's one advantage with this. And we also move to salt stack as our config cluster management tool. This has this is what we use for various things like the database example that I gave you. The sensor will detect that your master DB has gone down and it will it will assault master is the one which is aware of the entire cluster of DB clients. So it will go and it will restart everything in one. Besides that it also gives you other advantages like your source of truth is one you don't have to look up what is there in your infrastructure at what time. So along with auto scaling groups it becomes really powerful. So each time you scale down scale up you can the master can give you information about what is there in your cluster. Jenkins also we've integrated now this is more than anything help us build a good process for development. So earlier like I said we used to push code to get up and run fabric scripts to go update everything. That process is gone now we have a much more automated way to do it with Jenkins. So once once a developer thinks that okay this is my after running unit tests on their local you think that this is my release candidate. You just make a release tag Jenkins will kick in it will first run the unit test cases after that it will run the QA. And after which it will build Amazon images with the new code base. This is required not for deployment but for instance if after your code update a new machine is to be launched it has to come back come up with this new image. So that's that's why we do this build and it also packages. So our code base for instance Django or proxy is it's just like a package you can do apt-get upgrade and it will it's stored in a custom repository. So Jenkins will do that step as well after every all test cases have passed it will build a package so that your deployment code just has to run apt-get upgrade on those machines. Pinfra is an in-house tool we built it's inspired by Heroku tool belt. So what it what it does it gives you shell based access to the salt master to run a specific set of commands. So the reason for this was that not everybody in the team needs to know how salt master is working. Not everybody needs to log into salt master to run some commands. So this is a client side interface which gives role based access to various teams to run to issue commands to the salt master. This is very much tied to our infrastructure right now but it is something which will definitely look at open sourcing as we go. So the main benefit of this being that there is no human intervention required now when a machine goes down. We use AWS auto scaling groups for all the stateless clusters that I defined earlier. So now if a machine goes down we get a hook on hipchat or email just informing us what the state is that this machine went down for some reason. And this now using the launch config a new machine is coming up and there's absolutely no human intervention required. So this makes it makes the process less error-prone than it was with using fabric scripts. And we've also configured our auto scaling groups in such a manner that whenever a machine comes up the user data script that the first thing that an Amazon machine runs. It's able to make it a salt minion. So that means that our cluster management is immediately aware of every new machine that is coming up in the cluster. Automated deployments has been a help. Like I said from using fabric scripts and updating manually now all we need all our developers need to do is to create a release candidate and Jenkins will take care of the rest from building packages to Amazon images. So I don't know if this is very visible but this is to give you an idea of what was earlier and what is it now. I don't think is it visible. Okay, so basically the point is that the image on top that you see it's when we were using fabric scripts. A lot of painful processes you have to manually you know answer like at least 10 questions that I can count there. And now now with salt stack and with all this all these things that we built. What you have to do is like I said after Jenkins has done that step of rebuilding your package with the new code. All you have to do is use that P infrared tool which is a client side interface to salt master. So we give a command saying P infrared deploy cluster name that that will go this command will go to salt master. Salt master will tell all your it will it will do it in you can define a set of batch depending upon the size of the cluster. So say you have 10 or 20 machines in one particular cluster and you want to update those you can tell salt master to either sequentially remove one machine at a time or you can give it an argument to say remove two at a time it will take out two machines it will update code restart the service put the machine back in the cluster and then do that in a batch. So yeah besides this this tool also besides code deployments is also helps us push say a config change you want to do some change in RCS log or you want to revoke somebody's keys you want to you want to revoke access from SSH config files. We can we can do that using this because salt master will just go and you know pass that serve that file the new file to the cluster of your choice. Yeah, that's it. So you have any questions. Hi. See you're using a real time communication voice call is obviously real time communication right and calls can be across the globe. How do you and I understand you're using AWS infrastructure we use not for a media service AWS is for say HTTP cluster proxy database but the media servers are not on AWS. So how you're maintaining your media service because I'm more curious to know how did you identify the nearest node for a global caller you know it can be across India to use. Right. So media servers we have across three continents. This is the base that we chose is right now we have in Europe US and Asia. So there can be various call flows. So depending upon that we decide so to give you an example. Suppose you make an API request and you say that okay call a number in India. So depending upon the destination number our code will see that okay it's the business logic server will tell you routed to the Asia servers. Same way when an incoming call is coming say from from a sip phone a sip phone is nothing but a client side like something like Skype which you can register to our service and make calls from that. There we will look at your IP address from where the call is coming. So if you're if you're sitting in US and calling it will it will fall to a US media server. So two things IP based and destination number based basically. Hi I have a question. I mean in your slides you talked a lot about cluster management and all that thing. I mean I was wondering have you you know thought of implementing something like mezzos and marathon and making use of that as a cluster manager for mezzos or marathon. No these aren't the tools we evaluated. I think we looked between Ansible, Saltstack and as opposed to what we were doing earlier. So we've not evaluated these tools but like I said that we're a small team and we're looking at various options and we have to see what we can adopt the fastest which will serve our needs. So Saltstack was doing that well. Hello. Hi. Here. So I have two questions. First is why did you choose Saltstack and second is how do you make both Redis and Postgres highly available and what problems did you face making them highly available. Okay so for Postgres we have a master and slave and when we are monitoring system will do a right query on the master at a set threshold. We define after a few query define a number that okay say after these many queries we declare it as dead and then the slave will get promoted to master. After that you have to update all your clients say for instance at Django cluster that is talking to the database. How do you handle the promotion? Is it automated or manually? That is automated. It's the trigger file in Postgres that does that. Okay. So that is automated but central cluster management allows you to immediately update all DB clients to with the new configuration so that they know who the new master is immediately. And for Redis I know there is Sentinel also but we are not using that. We have a master and slave and we do the same thing that we are doing with Postgres that we just point all clients connecting to Redis to point to the new master and we don't store very critical data in Redis. Okay. So it's not of as high priority as Postgres would be. Okay. Do you have another question? Yeah. What is the reason behind choosing Solstack over any other technology? Yeah. So basically this was a move from fabric scripts from manually doing stuff and most of a team is very fluent with Python. And so Solstack was an easy winner. It was doing the things we needed it to do. I'm not quite clear how are you managing high availability in the media server. Can you please explain that? Yeah. Yeah. So the input to abstract it out, the input to media server is only on two ways. One is SIP and one is STTP. STTP would be when you fire an outbound call say over STTP it comes to a proxy and from proxy it will go to the media server. So there what we've done is we've made the proxy intelligent because our business logic will return a set of media servers. So the proxy will try the first media server. If the call is not working there or if it does not get a response saying that your call was successfully fired, it would go try the next media server. So that is from the STTP side. From the SIP point of things, like I said, we are SIP proxies. We also don't expose media server externally to our carriers. So carriers are people who we provision phone numbers from. So when we get a call from, when somebody calls a phone number which you bought from Pleo, it is going to come to our infrastructure from a carrier. So it first hits a proxy and proxy is so depending upon whether it's a US phone number that's called proxy knows about the media servers in the US cluster, the inbound SIP proxy. So this SIP proxy is sending continuously an options packet, which is an equivalent to a ping. It is continuously sending that to US media servers. A US SIP proxy will do that. And if it does not get a response, it will remove that out by remove. By remove, I mean like it will remove it from the set of media servers that it considers healthy and active. So if a call comes that it will never send it to that media server because that is already failed. The options packets, it's not got a response for every options packet. So the proxy considers it to that the SIP process free switch running there has died. It will not send a call there. It's a hardware proxy? No, the proxies are also on Amazon Web Services. We use open SIPs. Open SIPs is a SIP proxy. We use open SIPs for both inbound and outbound SIP proxy. That's not on hardware. That's on Amazon. Hi. You mentioned that whenever the primary database server Postgres switches over to the secondary, you ensure that all the nodes, however many that you might have, will switch over to the secondary. Right. So instead of doing that, did you consider using DNS such that you switch, you know, the... Yes, yes. Yeah. This is something which we've evaluated. But there is, you cannot trust AWS DNS propagation all the time. It's slower than this. We've evaluated both options. Salt, the communication between salt master and minions is over zero MQ. And it's parallel and it's really fast. So it's way faster to serve a new PG Bouncer file to all your DB clients or an equivalent of PG Bouncer if you're using some other database with the new configuration and just reload the service. We have evaluated this and it turns out to be way faster than, you know, hoping for DNS propagation to work in the advertised time, which it never does. Okay. Thank you. Thank you.