 Hi everyone, I'm Vishal Parveer and I run a company called Cloud Cover. That's me. So actually we have two companies. Act Development is a software-developed company and we've been building stuff for a while now. I started Act Development in 2002, so we've been around for 12 years now. And Cloud Cover is a new thing that I've started which is it deals with cloud consultancy. We're cloud agnostic, so we help companies achieve scale on all the various cloud offerings. Could I have a quick show of hands? How many folks are working like exclusively with cloud hardware? And is anybody here that doesn't work on any cloud stuff? Maybe we just need a bit of help. Sure. As loud as you like. Anyone who's not working on any cloud stuff? It's wrong with you guys. Come on, get on to the cloud. Alright, so I'm going to jump straight in here and I'm going to actually back it up and come into each of the individual elements as to why we chose to do the things we did and specifically talk about the problems that we had. But before we get to that, let me just tell you what this product is. So this is a customer of ours and I'm kind of their acting CTO right now because they don't have anybody that fits the bill. They've been our customer now for five years, so we've kind of scaled with them and we've got them to the point where they're capable of doing the work that they do. PySmart is an electronic pre-paid recharge system. So what that means is that they do top-ups for your dish TV, your Tata Sky, your airtells and also all the mobile carriers. So they also do it for like Vodafone and Tata Boko. They basically serve everybody. So any of the major satellite operators, any of the major telephone operators in India, they offer those services. So this is a B2B model. They have literally 50,000 different retailers all over India. They have 2,500 distributors and they're literally in every single town. And the way that it works is you send an SMS to PySmart. You send an SMS or you use the Android application or you use the website and all that stuff ends up at PySmart and then we talk to each of the different operators that are there. And we have to do this, so there's separate integrations with every single one of the carriers and every single one of the carriers has a different API and every single one of them is using a different backing and effectively it's a nightmare. It's like every single one of them is horrible. So things that you have to keep in mind as to the complexity of the system or the reasons why it was hard for us to scale, there's lots of concurrent connections. A typical top-up would take at least three separate HTTPS connections in which we are polling some service and we are waiting for a response and in some cases the SMS is coming in from an external aggregator. In some cases obviously it's coming in directly from the website or from an Android application in which case it's a little bit easier for us to handle. But when we started out all this stuff was synchronous and obviously one of the learnings that we had was let's switch everything to being as async as possible and I'll get into some of the ways that we achieved that. Some of the vendors were absolutely horrible. Some of them took 60 seconds to respond. I'm not naming any names but there are certain TV companies down south that their systems are absolutely horrible. I don't even know how they managed to get this far. It goes down every day, you know, there's... And the thing is that when we first built the system we kind of had multiple services sitting on the same instances and like you could have cases where one vendor goes down and takes down half your system. So it was absolutely horrible. But we've grown past that tension. Okay so I really hate cricket and I really hate cricket because not because of the sport. The sport, you know, I had Azharuddin and stuff and that kind of spoilt me on the topic. But not counting them. As far as bias point is concerned we have massive traffic spikes every time that there's an IPL starting up or there's a Champions League starting up or India, Pakistan, what's the big... Number of transactions that we see on a daily basis can double or quadruple on what I consider that day, what sales considers ability. Month end closing is always bad. And the absolute worst is the time we're entering right now which is your end closing where everyone is trying to close their books at the same time. And since this is a transaction and a B2B system obviously there's finance involved and obviously you're going to have lots of reports that need to be generated around this time of the year. So these are the difficulties that we face and all of that led to massive amounts of failure. Okay we screwed up big time. We were down for, you know, at one point in time we couldn't even door on the instances there was so much traffic coming through it. We couldn't get to them how many times you would have had an instance that you can't even SSHA do because it's just absolutely destroyed by our applications. We face that. So why exactly did it start failing, right? As a totally straightforward thing when I say fail I mean the application crash or the service tanked out basically what it amounts to is that you can't respond to the user in time. You know, the credit 500 server error, right? When that starts happening you have to kind of look a little bit further to figure out why, right? And these are, you know, the typical reasons why something like this could happen because you're constrained at a resource level, right? You have only a certain infinite amount of compute you have a finite amount of memory you have a finite amount of storage, database and network bandwidth and these things in conjunction cause issues in most cases it's just because they're not possible, right? They can't deal with the traffic that you're putting on it it might be because your application's not, you know, optimized enough to get it done or as is also the case you just haven't budgeted for that much traffic. Given that fact and given the fact that we've looked at these problems from various different angles now we've been doing this not just for Buy Smart but also for other companies that have similar problems of scale and, you know, 95% of the time the problem always comes down to the database it's almost like the guaranteed failure point because they haven't got the appropriate level of caching or their queries are too slow or they're using too many inner joins there's a lot of issues in terms of the way the application itself is built that is very little that you can do at a DevOps level to kind of fix you have to actually fix the code so that basically comes down to, again, disk and compute and memory and it's various different aspects of how much and how little you should have in order to have a system that is reliable so the first solution that everyone obviously comes up with is let's just throw money at the problem, right? What does that mean? That means let's add more instances, let's increase the size of the instances let's just basically throw more money at the problem and hope to God that it doesn't fail, right? This obviously is very, very expensive we have, I mean, at one point in time we were spending upwards of $25,000 on ourselves just to keep the application up and running and to give you an idea of how far away we were from where we should have been today with even more capacity we're running at a run date of about $6,000 a month for advice month so that was the saving that we achieved through just fixing the problems behind the scenes this is only useful as a bandaid, it is not sustainable at some point in time your CEO is going to come knocking and say fix the problem because I cannot afford to run this business if you're charging me so much for just to serve the hardware so there's a very real business problem that's underlying this, right? this is not just a case of hey, it's a technical issue this is actually going to matter in terms of profitability at some point in time so second solution unfortunately is to fix the problem I can't think of anything else that you can do short term besides just throw money at the problem but that will buy you enough time to hopefully start fixing these things and there's a lot of stuff to fix so what I'm going to do is I'm going to actually take you through the steps that we did now we didn't necessarily do them in this order but I have organized them in firstly the most logical order I could think of and secondly hopefully the order in which it is easiest to start so you start gaining results right away and the first thing that we started doing was we started caching static content this seems like an absolute no brainer I'm sure that a lot of you are already doing it if you aren't you really should because this is something that costs you very very little in fact it's probably going to drop your costs it is very very easy to achieve and it genuinely reduces the amount of work that your web server has to do the reason for that is so content distribution network is basically a place where you can upload static content and it gets served from edge locations all over the world you've got multiple options in terms of which cloud to use or which provider to use I've just listed four of them the Wikipedia page literally has like 50 of them so you don't have any kind of like concerns in terms of finding one that fits your bill and you know achieves what you're looking for you don't need a web server for a lot of this stuff CSS files, JavaScript files, images, God forbid flash files all these things are static content once you've built them you save them they're not changing there's no execution that's occurring at the server so these are very very good candidates to just push onto content distribution network change the link that you're using to access that resource and who you've got you know 20, 30, 40 less requests going to your web server every single time somebody loads the page how many of you guys are using web applications how many of you are administering web applications every single one of your problem at that point in time look at your web page and tell me how many of them have CSS files how many of them are JavaScript files how many of them are image files every single one of those shouldn't be going to the web server shouldn't be asking the question of your web server number two manage services so if you're in the cloud and you're running your own database server on a standard, box standard instance my recommendation is get the hell off it because there are managed services available from pretty much every single one of the cloud providers and a whole bunch of other people besides people like Engine Yard people like Heroku are offering you specific servers that take care of your database workflow they are very very good at administering them they are very very good at monitoring them they are very very good at scaling them don't try and hit with the wheel use stuff that other people have built that are already operating at scale so in our particular case we were using MySQL we used Amazon's RDS which is the relational database service and we used that for being able to create lead slaves we used that to create replication of replicas so basically you can create a copy and then create a copy of that fantastic for reporting because you don't need to have that data coming from your live server and it takes care of creating the instances easily and effectively like an object of a button or a script execution so it gave us the freedom to concentrate on other stuff so the relational database system that's provided by service that's provided by Amazon that's what we use, we use MySQL on that it's basically a managed service specifically for databases so you have the option of using MySQL, Postgres, Oracle and SQL Server and you can choose the instance, the type that you want and it can just basically fire up an instance and it's fully configured and ready to go so they give you an IP address and if you have a username and password you can use that so the technical application starts consuming the database you can change the size whenever you feel like it you can change the size of the instance itself at least the compute and the memory or you can change the size of the hard drive dynamically it takes back up 75 minutes automatically and those backups are kept for a duration of time so you can use them as point of time storage stores and if for God forbid something horrible happens to your instance you can get another one in about 10 minutes so in terms of downtime, it's insane because you get it back with the last backups data so you can basically say this is my master server I need to have a read slave that I can start mission for just read queries basically right click create, read save, wait 10 minutes then this happens so you take away that problem of oh my God I have to worry about backing up my database how do I transfer the data, these problems they bog us down and really we have no business doing these things because there are other people that have already solved this problem why are we reinventing the view so we've used RBS exclusively for our database stuff I can't say enough good things about it there are equivalent services from all the other major clouds and from a lot of other smaller players as well you want MongoDB, that's available on Rackspace just go over there, start up an instance and we're ready to go you want SQL server, it's available on Azure so you've got so many different options these days you really need to go do your research, find a managed service for your database, migrate your data over there and you will see benefits in terms of the amount of time it takes you to register and the amount of flexibility that you have and these resources become fungible because anytime you need another one you can just right click create another one it's not like that huge headache of how do I migrate my database it kind of takes away the problem of that how much time do you have for it? it's typically the same as what you would pay for a normal instance what's that? okay so let me give you the most expensive one they've got if you're running Oracle on RDS in a large server it's about 54 cents an hour and that's like with the Oracle license be big team so you don't have to buy the Oracle license, it's actually charging you for it for an hour which means that when you stop it you don't have to buy the Oracle so the next point was at elasticity this looks super simple, this is the hardest slide doing this is the hardest hardest part because you have to make sure that your application whatever the point at which you want to start scaling it out you have to make sure that that application is actually capable of running in parallel and that means that things like sessions need to be stored outside the web server things like your database needs to be common because you're basically calling the web server once twice twice four times in order to deal with scale horizontally and then you'll be on balancing so you're taking every single request that comes in and you're saying okay I want you to go to server one, I want you to go to server two, server three, server four and it has to be smart enough to do that and not screw up the user experience which means that your application has to be clever enough to deal with the fact that I'm not going back to the same server every time I hit F5 I'm not going back to the same server, I might go to a different server and that server has to be just as capable of answering that user's request as the first one so this took time for us to figure out luckily we were using external session state management we already had a common database so this part for us was not killer but I had enough problems dealing with customers of us where they don't have this stuff sorted out to begin with and that's a huge, huge asset hopefully your application and the operation that you already got going has some of this stuff sorted out but this is a killer if you don't have this, this is like step one to being able to scale and again, auto scaling is something that so let me touch upon auto scaling here auto scaling is this concept which is very cloud specific because it requires a certain amount of virtualization for it to work effectively what it means is you set an arbitrary parameter whatever you think is right so this is actually figuring out what that parameter is also very difficult you have to choose based on your application what is the point at which this server cannot serve one more guy like if somebody else wants to come and ask this server for information it would fall down, it would die at that point in time, hopefully not at that point, 75% of that you want to have another server startup automatically that will deal with that next request that comes in and more important than that is to remember that when that requirement goes down again that server needs to shut off as well because again, it comes back to price so right now we have three separate auto scaling groups at BySmart all of which are dealing with different aspects of the application that I showed you I'll give you a network diagram of how we've built it at the very end so you basically figure out at this point in time, let's say 50 requests a second on this box if I put one more request on there, it's going to start dying I need another server up right now and you build that script, you figure out how to measure the point at which you want to do that you figure out how you want to scale it and the cloud offers you APIs that let you scale and then the new instance that starts up gets added to the load balancer and it just starts handling requests and the second that it's done, you remove it from the load balancer and drop the instance does that make sense? number four, cash this is hard, not as hard as the other one if you haven't done it but it's still hard before this, I think how many requests have you got on that? I have like 50 requests I get another one, if you want to build that so how do you calculate it and cut it and that? I want, since I want this 48 requests then we will cut it down how does it work? so when I say that it can serve 50 requests today we have a mechanism which tells me how many concurrent requests it serves and I'll get into how we got to that point where we can actually figure that out it's not something that is available to you as a metric it's not something that you can just say, hey this is the parameter that I want to use now when this parameter gets hit, scale the instance usually scaling is done on things like CPU utilization, RAM utilization how many concurrent requests are there these are the things that you can sort of get from your application or from your VM layer and those are fine and maybe those will work for you we have had a bad time doing it on CPU because our application, if you remember in the beginning I mentioned the fact that a lot of the vendors take a long time to respond so if I'm communicating with, say, for example, say, Dishne and Dishne has a problem and it's taking like 2 minutes to respond back to me but you know what happens there is, now I have, you know, a thousand threads all of them eating up RAM, all of them just sitting over there waiting for a response from Dishne, well, that takes up no CPU plan but my RAM is there, like I'm done, there's no space left and the next one that comes is a server file number you have to fit your scaling requirements based on what you feel are the appropriate parameters so now for that particular box what we're looking at is how long is Dishne taking to respond for us that is the most important parameter so we're using that to define whether you should scale or not so you have to figure out what metric to use or the combination of metrics to use in order to achieve almost scale correctness it's not straightforward and anyone who sells it straightforward hasn't done it in practice when you're actually doing it, you realize that you can't just say oh, CPU hits, you know, 90% of the scale but then it's already too late and the application is already failed to hit 90% so finding those parameters is not easy but getting them right is like, once you've done that you kind of fixed the core underlying problem you fixed the fundamental issue and that makes the rest of it much easier to deal with we use this metric now which is cost per unit of work and unit of work is whatever you define so like let's say in our case a unit of work for Buy Smart is one complete transaction so we're going to go and get an SMS to deal with the financial implications of that within our system to forward it to the vendor, to get the response from the vendor and send the response back to our retailer like that entire leg is one unit of work what is that cost in terms of hardware? what is that cost in terms of software? what is that cost in terms of computing power, raw computing power and then up to $32 figuring that out was the most useful thing we did because once we had that number when that number went down we knew we had done a good thing when that number went up we knew we had done a bad thing so we introduced a feature to the web application that drove that number up by 20% a feature, right? it ended up being a bug because it was screwing up it was a new graph on the dashboard and every single person that was logging in was hitting that graph and every single one that was doing that was causing our CPU to slightly spike hard and when you amalgamated it over the course of time that thing cost us 20% which was insane for a small feature like that which nobody really asked for it was just one of those things developers thought would be cool so they put it on the dashboard started screwing up he forced them to optimize it, moved it to cache and then it started working okay so cache is like magic if you do it right cache ends up being like magic because all those queries that were taking forever to execute suddenly, magically they come back in milliseconds because what's happened is you've taken something that used to use a screwing disk to get the answer and you moved it into RAM and when it's in RAM, access to that is almost instant I was having a conversation at JS School in Bangalore last year with one of the guys who was building Book My Show and they hit a hard wall in terms of their scale as well and what they did was they just moved everything into Redis like the vast majority of their queries just moved into Redis and suddenly, magically Book My Show doesn't go around anymore because they've taken care of the bottleneck the bottleneck like I said in the beginning almost all cases ends up being the spinning disk behind the radio because that thing is slow it's always been slow, it's going to stay slow until we start using SSB back storage and even then it's still going to be slower than RAM no matter what you do so it always makes sense when it is possible this is expensive because you're putting all your data into RAM and RAM is not cheap and RAM is volatile and if that server dies you've lost that data but if you can figure out what to put into RAM it can make the difference between life and death as far as your application is concerned figuring out what to put in RAM is another thing it's like we looked at the top 50 queries that were taken too long and what we realized was in all those queries the thing that was taken too long was the fact that we had to check whether this user was allowed to buy this product and that requirement to check if the guy was allowed to buy the product required us to go to the database pull out all the products that he had access to and figuring out whether in that product list this particular product is allowed to be bought then we do another query to find out whether this guy had enough balance to be able to do the purchase then we'd have another query to check if he was blocked in the system first thing that we did was we combined all these queries into a single query that improved performance a little bit but it was still a spinning disk at the end of the day and all that happened was instead of firing three separate queries we were firing one query with lots of joins so it didn't really give us that performance boost that we were hoping for what we ended up doing was we moved that entire dataset of whether that user was allowed to access that particular resource into the cache now I have an object for every single user in my system when the question gets asked do I am I allowed to buy this the response comes back in sub-MS numbers so and the database just stopped heating up magically just stopped heating up so we're still using the RDBMS at the back to store that data if we ever change that data we still write it back to the DB but rather than every single transaction requesting that data from the DB we requested once when the data changes and that's when it's stored in RAM so by using SQL as your canonical store you're basically taking away the problem of the volatility of RAM because you're never changing the RAM back you're always changing it on the disk and then updating the RAM what that forces is that because that data is already on the disk if this server was died I hadn't lost anything because all the data is still in the database loose coupling loose coupling is this concept where rather than having your application call things in sequence and rather than having all these parts live inside a monolithic application to split up pieces based on units of growth so in our case we split it up into this is the SMS piece it deals with getting all the SMSes from the aggregators then we had another piece in the middle that dealt with the transaction itself that is cutting the money from the guy's basket making sure that he has the appropriate funds in the basket all that stuff is the transaction engine that's like the core of the system and then we have another piece which was the aspect that actually has to talk to all the external services so the one that's talking to video car and the one that's talking to dish TV the one that's talking to TataSky all of those are now separate services so we broke the application, exploded it into as many pieces as we could think of made them into small small small small little modules and now you follow these little modules you want them to talk to each other right how do you make them talk to each other so we use choose and we use publication and subscription so how many of you know what pub service is super simple really quick all it is let's say that this is a publication and any number of applications can subscribe so let's say that for Iraqis you receive an SMS publish the fact that we've received an SMS and I have one subscriber whose job it is to write an SMS to the dean I have one subscriber whose job it is to write an SMS to the cash I have one subscriber whose job it is to write that to a law one to write a review tomorrow if my CEO says hey listen, if you get an SMS I need to be able to set a notification out to this particular other service but that's really easy now because that's become one more subscriber so that changes the way you start thinking about architecting the application and it also gives you this amazing flexibility to deal with problems that they file that your business keeps throwing at you if you split up all various activities of your application into individual publications just like thousands of publications now we've actually built a tool where this thing sits over there and it subscribes to everything and I have a law, like a console law that shows me in real time every single thing is happening on all my servers because there are subscriptions, there are publications I get those notifications on my browser window and this keeps scrolling like this because we've got so many different transactions that are happening the value of having that really fast scrolling list is when errors start occurring you start seeing rows get marked in red as the errors are occurring so if you see like a scene of errors suddenly start coming you know something is up one of your servers is behaving so we use that automatic obviously now we have stuff that subscribes to these things and deals with the errors individually but seeing that visually made a huge difference initially so yes, publication subscription queuing I think all of you all should know what it is basically building a queue to start pulling out items from it it's a way of using many many small services to deal with the same dataset so let's say that in our case every single one of these transactions has to be very close every single one of these transactions whether it's come from the android whether it's come from web, whether it's come from sms I have to process that transaction so I put it in a queue and the appropriate services pick up those transactions deal with them and put them back into another queue and that queue is then processed to deal with whether you have to send a notification over the web whether you need to send an sms out all those things are handled from there so we have split everything up into publications and interviews the value of that is it allowed us to rebuild our autoscale because now all of a sudden we had all these small small small parts so you took an application that used to live on one massive server that was able to autoscale but I was autoscaling this huge server every time so when I ran out of space, when I hit that threshold and I had to scale, I had to scale another huge instance now when I hit the threshold I'm hitting the threshold for just the Tata Sky service or I'm hitting the threshold just for the Sundirect service and only the Sundirect service starts scaling which is a small little instance so again, my costs went down because I had all these little pieces that I had to scale rather than this one big application I'm going to go back here so consider no SQL it's not some magic bullet that if you shoot it at your application it's going to solve the problem but it does help in specific instances it makes a huge difference in terms of like for instance, for us when we started using DynamoDB we've been using separate DB for a very long time and now we started switching to DynamoDB because separate DB is done it took load away from our database and it gave us the flexibility to be able to just keep adding more and more rules without having to worry about the schema we didn't have to think about, oh, this particular vendor of ours is returning some completely different type of data from this other vendor I was able to take all that data irrespective of whether those fields matched and put them into separate DB it allowed me to basically worry about the problem later and that allowed us to scale without killing our developers and that's important as well we've used MongoDB in the past, we've used a lot of Redis which is not on this list but it's also very very useful as a key value store I encourage you to look for the right tool for your job each of these things does something totally different each of these things makes a different trade-off in terms of capacity or availability or partition problem and you have to choose which one you want you have to choose which ones you care about and deal with them for you it's not an easy choice and it does take a lot of effort in terms of figuring out the right stuff but it is valuable and it does actually give you benefit in terms of scale once you've got other stuff, you have to optimize the cost you've got to fix the problems in terms of reserving capacity so this is the case for most cloud services but it also makes sense for specific instances that you've got running even local stuff, if you've got hardware that you bought in like a data center you have to choose how much of that you need so resource planning is super super important in our case we have to choose what to reserve how much to reserve because at peak load on a Sunday evening of an India-Pakistan match I might see 600-700 transactions a second at that point in time it's too late like I have to scale but I can't afford to reserve all those instances because I'll use them only once a year or twice a year so you only reserve the stuff that you know is your baseline that's what you have to use every single day no matter what come rain or shine this is an iterative process you reserve only when you show and then you choose later on when you see that this particular thing is getting used a lot again if you size your instances correctly it allows you to size scale in smaller increments so you use very very tiny servers that are capable of doing just that job and they have many many of them because you have no limitations in terms of how many in the cloud you can have hundreds of servers running nobody cares but each of those servers has to be sized right otherwise it will take much more than that particular computer class so let's talk about bismarck today after we did all this crap that's how many stores we address all over India some of them are like you know off in the islands of India that is the footprint of the number of detailed locations that we've got this is our architecture it's highly simplified for this life there's many other moving parts and there's stuff that obviously you can't see like you know our development environment, our staging environment all that stuff is going in here but in and out this is what we've done so we use firstly this entire thing is running on Amazon we use root 53 for DNS resolution which is there in DNS servers that's for bismarck.co.nl all our users come here we've got load balancers for SMS we've got load balancers for the web and the android application is now consuming the web API as well so basically we used to have another line with your Android now all those lines have ended up being reverse so this thing passes SMSes that come in from our aggregators our aggregators send us what's effectively an HTTP post we pass that HTTP post speed it up into all the pieces so it's like the syntax is top up dot password product name dot amount dot phone number so we have to speed that up into all the parts that's what the SMS servers do and then funnel that into the queue and this is pretty big it uses ssd backscoring it's a lot faster and more reliable that's what we're doing right now that's part of our development process so sqs is the queuing system what we do is we put the data in here and we add an element to the queue of it and then these are all the services that we're using for all the third party vendors and this thing scales horizontally each of these services is separate so there's one server set for videocon there's one server set for data skype and so on and so forth so firstly that's not just one it's an autoscale route it's actually 32 separate autoscale routes for this piece and that scales all the time to use it so as you can see by mostly coupling all the services that's our initial database system so we have one master server we actually have a show on all of them and we've got at this point six separate readscapes all of them will be various distributed and they sum them up small sum them up large and tell you about the book any questions about this? what I want to say? so separately there is no security and there is RDS we're using both of them we have to use both because our transaction engine needs consistency because we're dealing with money so I have to cut the money and make sure the money is cut and then I can pass it on what we're using simply in SPS score is to deal with these services so every time there is a transaction every time there is a top up that has to happen once I've figured out some RDS that you have enough money that you're allowed to access it then I write an item to the queue and I write the item to the security queue and then these services pick up the item from the queue say okay this is the key that I need write the key from security and then process the entire transaction and write it back to the queue it's actually a different queue that it goes back to like I said this is super simplified but it goes to a different queue which is the back in dog queue and writes to a different security queue with the data that came back from the venue and then that gets processed once again via SMS or our web server but your SMS user does not come through without anything in it no basically these guys well they still use bysmart it's not bysmart.co.in we still use root53 for DNS but those URLs are basically set inside our aggregators so our aggregators have a specific endpoint that's been defined for them to come and hit on our SMS servers and each aggregate has a different URL so then we can keep them separate and make sure again they split that any other questions so we're actually in EU because when we started this Singapore wasn't available here so 5 years ago and what's the reason for combining the android and web server third line? Android was basically consuming over the API right and what we realized there was no good reason to have that good base be separate because every time that we made a modification to the API of android we also had to take the same modification to the API of the web so we ended up combining the APIs and we unified the android in the web application so the web application now is like a full HTML5 socket based thing we're using Node.js for publishing data we're using Redis to push notifications out to it so you can do stuff like you can start a top up and then do another top up and do another top up and do another top up and they all start queuing up on the side and then the socket sends back data which is data back once this entire process is finished so once the queue has processed once the data comes back one more publication occurs and that publication pushes it to the socket that was corresponding to that request that's over with the top up that took us a while anything else? how many subscribers do you have? if you have questions then please ask them in the mind this is live streaming so people will be able to hear your questions as well what is the how many subscribers do you have? so right now the server number specifically of reserved instances is about 30 but depending upon the scale level obviously the numbers go up and down so I actually don't know how many servers are on the right now because depending upon the workload it might be more it might be less are you using the CDN for that? are you using the service of CDN? yes we're using Amazon for that as well we're using cloud for that actually when we assign the stars that is moving then moves and moves what happened if I try to play there is a you have to try to play with one but it's very very slow so what I'm thinking like we are using this on live streaming so part of the line CDN CDN place yes I'm already in India so you're probably pulling from somewhere how do you see the idea of what CDN provider is good for we only use Amazon and Amazon has two locations in India one is in Delhi and the other might be in Chennai or in Bangalore I'm not sure so there are two locations in India nothing for western, Maharashtra or India but it's pretty fast, we haven't really faced the problem in terms of stream we tend to be a lot quicker in terms of base growth now because we're using the CDN exclusively for even solving the application so what we did was we built a static HTML application with javascript files and all this dynamic content is being pulled over javascript so we don't our web application is more than being served from a web server at all it's just a web app that gets delivered over HTML and then javascript is doing all the rest for okay so just a few numbers these are the number of transactions that we perform on a good day we do about 500 transactions a second at peak load and that's being managed with the server capacity that we have in the sql about about 6000 dollars worth of hardware and we have about 30% headroom so we leave our servers running at about 60 to 70% capacity to deal with that eventual spike that might occur and autoscale only triggers where it passes that 66% threshold so basically keep 33% headroom on the servers before triggering our autoscale servers right yeah so you know some other numbers are very interesting to a large extent we don't have downtime at all because we kind of built it so that all these different parts we might have like sort of downtime for one service because the vendor is not responding like you know dish TV decided to do a reboot of their server and suddenly you can't get dish TV to go through but that's outside our control we try to deal with that gracefully at the client level but by and large our servers don't go down that's good and you know to a large extent that's going to hopefully stay the case we are making a whole bunch of changes in terms of we're starting to add more and more Node.js code because we find that Node.js is very very flexible in terms of being able to sit lots and lots of concurrent connections open without eating up and destroying your CPU so we're moving some of the services specifically the stuff that's talking to vendors to Node.js beyond that some of the stuff that we're trying to do we're trying to use MapReduce so some Hadoop stuff to be able to mine all this data that we're collecting so that's on our roadmap sure thank you this is about testing strategy and this is the main infrastructure and the migration from one to the other and again I'll provide an application and your API to basically combine that and another related to your code an application infrastructure how do you manage your export strategy so for us testing was a case of two different very important things one was the actual testing the application didn't fail but a second more important thing was performance which was a lot harder for us to engage in testing because a change like I said when a change occurred in one particular place they added a new piece of code to the dashboard and suddenly our performance level went up or it went down by 20% stuff still gets through the gate stuff that should not have ever been in production the ends up in production because we have a very disparate system in terms of the load 90% of the load is actually not from our system it's coming from an external person it's coming from how long it takes for a particular vendor to respond so it's very difficult for us to kind of load test to the extent that we have to I can't foresee where the failure will occur here so what we've done is we've split it up into pieces so that when a failure occurs it's isolated only to that part so that was the major learning that we do as far as testing is concerned obviously we have automated testing we have testing that's baked into each of the modules that we're building and then we've got three people whose job it is to break the application so they're testing on all the various platforms and that's just the manual like every time the feature has to go out it goes through the usual checklists and things our deployment strategy is everything has to be up on the staging server and fully tested and certified by everybody prior to moving into the production server and the act of moving it from staging to production is completely automated so we basically flip a switch we say okay this is the version number from git that needs to be put on to the production servers we initiate rolling restarts if required and that's all scripted that's all like I'm using shell scripting and aws so aws lets us control the instances themselves through a command line so I can drop an instance from the load balancer and I can add it back from the command line so I can build it into my script that drop the instance, restart it when it boots up it sends a ping back okay now add it back to the load balancer start the restart on the next one so there's a script that basically deals with that rolling restart all the way across yes so we have our own custom monitoring tool that we've built plus we use that on top of cloud watch metrics which is something that amazon themselves provide us with amazon's metrics give us data at the VM level they are not technically allowed to look higher than that so they can't see what's happening inside my application they can't see what's happening inside my RAM so in order for us to actually figure out what was happening for our application we ended up building a suite of tools ourselves and that's something that we are considering packaging as a product because it's gotten to the point where it's pretty advanced as well so let's see maybe I'll be back here next year to show you the product I think that's rich yeah that's it that's all I got sure well it's kind of hard to say at this point because like I said we run two companies right once the software developer company cloud infrastructure company EPRS and BySmart that the company that owns BySmart is called EPRS were our first customers as cloud cover right and a lot of the systems and things that we developed were around this installation so today I would say that 90% of it is automated there's a guy one guy that has a dashboard open that sees what's happening on the servers in the event of something horribly going wrong obviously it's all hands on deck but routinely like you know day to day very little of our resource maybe 5% goes towards just the upkeep and maintenance of the system but it's gotten to that point after many years I mean it's not something that we woke up in the morning and it was that easy it's got to that point because we had to do all that stuff together sure what do you mean? what do you mean? so we used all the CSF we used Redis for publication and we used Node.js for the application itself we used most TML, CSS, JavaScript for the client dashboards using these three graphs that kind of stuff yeah are you not using any are you not using the publication? we're using Chef so we tried Pocket we settled on Chef after Raging Eternal Debate I'm still a partial to Pocket actually but I lost that part so far we also used a couple of the tools we tried out the stuff that Amazon provides as well they have a couple of DevOps management tools that they have so we tried those as well at the end of the day we felt like we needed more customization we needed to do more stuff so we ended up going without this is actually an instance running that just takes care of the Chef requirements okay anything else? thank you very much guys