 Hey, so I'll I guess I'll get started and we'll wait for the people to pour in as they're coming in I'm going to talk about scaling at error-ception I'll start by saying a little bit about me first. I am the founder and the only developer at the reception It's a huge team of one person Thanks wow and I've been I've been a JavaScript developer for half the life of JavaScript as as Vishal already pointed out It's been 17 years or so and like half the life of JavaScript have been a developer It's a little shameful because like when I started, you know, we used to use like layers and you know stuff like that was like good old days anyway So what is error-ception? I don't want to make this a product pitch But I'll just like quickly run through what it is It's this little snippet of JavaScript that you insert into your page and it catches errors that happen in the browser While it while while your app is running at your user's end, right? So like it's sort of like Google Analytics, but like you drop it in and it catches errors and stuff for you There are over a thousand active projects like a million errors caught per month on average Closer to million right now And you know, so this is this is what a reception is and that's my pitch I'm not going to talk anymore about how awesome a reception is I'll you know, I was sort of structured this talk as a as a sort of story where this is going to be sort of the the experiences of a client side primarily client-side developer who sort of started with doing server-side JavaScript with node and How that worked out so you know along the process I can talk to you about how I scaled it to this to this level My first startup had failed I spent about five or six months and I failed Well, I guess it's good that it failed early better than failing after two years, right? And so I decided to launch the second one which happened to be a reception except I was not in a great state of mind having lost all that money and stuff like that So I set myself a timeline of 15 days and I decided to launch a reception like literally from the time I bought the domain the time when it went live in 15 days And I had no real back-end experience before this, you know, I had never done anything serious on the server side before So obviously the code was totally crap There were absolutely no tests and there was no thought put towards scaling and I thought that's fine right because I Launched like nobody's going to find out who cares about you know another another website that's up somewhere, right? And so like it doesn't matter It was one single monolithic Node.js app and it basically talked to MongoDB and that was it. That was the architecture, right? It was a very simple set up And it sort of worked, you know can't complain it sort of worked. It did its job And it was running off a single like I'm a cheap guy So it used to run off a really cheap VPS hosted somewhere in the US With only 512 mb ram And this worked. I mean I had the option of upgrading but I was not sure if I'm going to succeed this time with my startup I mean it could have failed again, right? So why invest heavily on hardware when When you don't even know if you're going to succeed yet So I When I launched I sort of submitted to Hacker news and I was number one on Hacker news for a couple of hours And you know, there's nothing that can go wrong with that, right? Where actually node withstood the load from Hacker news pretty easily and that's because most of the people from Hacker news Just come to your home page and then talk crap about how crappy your product is, right? That's that's the Hacker news crowd so So that did not actually test my application really because it was just serving static pages. So, you know, it all looked fine Let's see. I hope this works So everything looked good except, you know, when it was on Hacker news people start noticing it and then about a couple of days later There was this big Russian website now, obviously, I'm not Russian. I don't know too much about Russian websites But this website was huge in Russia. I did not know about that And they they decided I'm not going to name them because you know obvious reasons But you know and he decided to start using happened to become a good friend later And you know, he came down to India with caught up and stuff like that But he decided to launch with a reception on his site and I did not realize how massive it was going to be It was you know, couple of hundred requests Per second and they were posting errors at the rate of something like 10 errors per second from the site Now these numbers are not bad. This should look all right, right? Well, it was all right for engine X and for for Linux, you know, which was like sort of Shouldering it for me, right? But no did not handle it very well even at such a low at such a low load But it wasn't nodes fault it was actually my fault my code was crappy so, you know That brings me to the first lesson if you're ever programming with node and you want to build a system to scale is that do not do anything That takes up CPU in your code. It is the biggest mistake that you can ever do And you know, I go over the reasons but I'm pretty sure you are already familiar with what the reasons are is that you're sort of clogging up the Event loop and like nothing else can happen at that time So, you know, if you've got tight code paths code paths that are executed frequently You you you generally don't even want to have even small loops or something over there because that will like totally mess up How you work with the event loop? I used to have a hot code paths For deduplication of error so when an error comes into a reception It's then you know compared against several other errors that are already recorded and if there is any duplicate They sort of marked as duplicates of each other and you know as is obvious this will require some sort of iteration It'll require to go over errors that are already recorded and then figure out if there is duplicates in them And that sort of mark them as duplicates So that is that was happening for every error that was getting posted If you're doing this ten times a second and you're doing this in your node app, then you know you're screwed So obviously, you know my first step was to rewrite all the hot code paths I had to reduce or eliminate if possible all the loops that were there and You know somebody one of my mentors taught me about this. He's like He told me that I'm too old-school in the way. I use the database in the sense that I Try to avoid talking to the database because it's going over the network and you know That's considered to be expensive and so on and so forth He said no forget about that You know be chatty with the database if you have to rather than keeping data with you inside memory And then trying to loop over that data in memory, which you think will be faster It's just better to go over to the database and ask for the data again and come back and there is no harm in doing that That's absolutely fine So that was that was like a revelation to me and it turns out that was actually very good advice I also had to rethink obviously the logic for finding duplicates How do I identify duplicates and so on so that it's not an iterative process, but it's some sort of lookup rather than iteration So that helped me Scale node on the same machine without having to you know, give it too much infrastructure And you know that looked fine couple of months later a big advertising company happens to be an Indian advertising company Happens to also be my previous employer. So that was a good thing They decided to use a reception, you know, I was excited to signed up for the most expensive plan That makes me, you know, some amount of money. So I decided to go blow it at the bar Because that's what you do when you get some money, right? So and You know, everything was looking good. I was making plans calling up my friends I was like, dude, did you hear about this like we need to celebrate right and like I was I was ready to do That when suddenly I said get start getting alerts saying that something's wrong with the server, right? And I was like, oh dude, this is this crude now What happened is that they were the they were an advertising company that was supplying ads to yahoo and The ads went on yahoo's homepage, right? So all of yahoo's homepage traffic started coming to my you know, my server on on the VPS, right? and And by the time I'd done a little upgrade so my server was at 768 and be one machine 768 and be Ram And all of yahoo's homepage traffic hitting my server not not a good thing to happen So, you know, I had read this from from what one of the Instagram folks Where he said that scaling is like replacing all the components of a car while you're driving it at 100 miles per hour Right, and it's true. Like you can't You can't take a downtime, you know, you have to scale instantly and nobody should notice what's going on So the first thing I want to impress upon you is that scaling is very unsexy, you know, it's not It's not glamorous. It's not like we can talk about oh, you know, we've got built this beautiful stack this that and the smallest problems And scaling are actually very very Stupid problems. The first one is is raising the number of sockets that you can handle on your OS, right? That's a big win Secondly is making sure that you're using the right number of worker processes at engine X and stuff like that You know, that helps a lot like tuning how the load is going to get balanced across all your processes And this was the most unsexy of all of them. What I did is I just call them up, right? I told them dude, you're screwing with my server. Can you please like get me off yahoo? I don't want to be on yahoo right now. Give me 15 minutes. I'll sort it out Right where like I like get me off yahoo's page right now And you know, fortunately, I know them and they they you know responded to that The moment they removed it from yahoo, of course the server recovered in about 15 minutes And then the first thing I did was like I you know went on to Amazon figured out how to use their CDN and you know got a CDN You know got got some CDN got stuff set up on the CDN essentially and And you know when I did that I sort of wrote a blog post marketing it as being all geo-optimized Like now it's cashed locally and shit like that Fact of the matter was that you know my server could not handle it So that's why I had to do it, but you know marketing always wins, right? And you know all I had had enough by this time right I was By this time I was going like when I when I used to hang out with with friends at bars I had to take my laptop with me just in case anything went wrong, right? And I had to sit down and tether my iPhone to it and you know connect to the internet and just monitor my servers Making sure everything is fine while my friends are like getting drunk, right? So it's not the best sort of setup to to be in and so I decided that it's time to now you know be a sort of architecture astronaut or whatever and figure out how to how to like actually make the scale and That's what the crux of this talk is about I want to just sort of impress upon you one idea of scaling and this has worked very very well for me and I I just the reason I'm telling this to you is probably because I Can't I can't imagine why other people are not doing this enough. It's just so simple Everybody just should should be doing it. So, you know a quick rundown of all the problems that I wanted to solve First one was that I had absolutely no confidence in my code. You know the code quality totally sucked And I I was making changes and basically praying hoping that it would work, right? So that's not good I Did not have any monitoring very little monitoring. I had built it in 15 days, right? So you can't build all this stuff at that time and You know, I'm a huge fan of keeping things simple So, you know, I wanted to remove complexity from the app rather than adding complexity as much as possible So just because I want to like scale up the architecture does not mean I'm going to make it far more complex In fact by scaling up the architecture if I could make it simpler that would be preferable, right? And of course the last problem was about deployment, which I'll talk about in some amount of detail First of all when you deploy a new node app, it requires a restart your server has to take a little bit of a downtime To to do a restart and down times are bad. We know that Sometimes the downtime is longer if you want to do a database migration from one say schema to another schema This could this could be a time-consuming operation And so you don't you know you your app is down at that time because you don't want to hammer stuff at your database or read Stuff from your database while your app is down. So deployment is you know, it's it's tricky and You know, obviously when errors when when the site is down I'm not collecting any errors and sort of you know, that's my business value proposition If I'm not collecting errors, you know, my site is useless And you know, obviously an app that's down looks bad But on the other hand, you don't want to minimize number of deployments you do you want to actively deploy as frequently as possible, right? So now how do you how do you deal with this? By the way, I recommend that you guys this guy is the creator of closure rich Hickey And honestly, I haven't looked at closure in too much detail. I'm not on the list side of things But this guy is a great philosopher of how to do computing So, you know, just watch this talk called simple made easy where he impresses upon you the idea of having simplicity in your architecture And it's just so beautifully. He puts across his point so elegantly. It's amazing to listen to him And this was he was a huge influence for my design going forward. So This was this was the most important thing is that rather than having one monolithic app I decided to break up the app into several small components that would talk to each other separately right Each piece is doing just one thing and doing it well This is very close to the Unix philosophy where you've got one thing and that does only one thing and does it well, right? So it started moving closer to that rather than having a large monolithic application Each of these independent pieces are deployed independently and are versioned independently. So they scale Independently right if there is a problem the problem is not with the entire app The problem is with one small app that does one small thing It's a much easier problem to solve now rather than trying to figure out how to scale the entire app at one time, right? So that's amazing And so, you know, obviously you ought to figure out how they will talk to each other So there's some sort of message-passing system between them. I Happen to chance upon Redis, which is just a brilliant brilliant way to sort of buffer up Sit a buffer between you and the database so Redis and you know Redis is sort of de facto for almost every node stack simply because it's so elegant to work with I strongly recommend that you have a Look at at Redis if you're doing anything on the server and You know, it's great for storing temporary data and got amazing primitives for for dealing with atomic operations It's awesome And then there are cues cues is just a fantastic idea The idea that you're going to take a task and then keep that somewhere in a queue for someone else to pick up and do later So, you know when you've got apps put up into multiple pieces You can take an app that you know generate some output and that's that's handed over to another app by dumping that into a queue And now there's someone else who's going to pop out from the queue and do something with that, right? So that's just a brilliant idea, and I'll explain why this is brilliant is Oh, yeah, is the last point right here is that if you so, you know If you kill one of the applications that's in the stack What will happen is the queue will start building up right because now the app is not popping out from the queue however The site does not look down the site still looks like it's up because errors are still going into the queue, right? and so now you can take apps parts of your app out and And then you could you know restart them or like, you know, whatever service them the way you want And then get them back online to start flushing out from the queue and this is just so elegant It's amazing. You can like bring down parts of your site without having to bring down the entire site So a reception isn't just one site. Obviously. This isn't just one app. It has got the UI server Which deals with you know, it's written express for those who are familiar with node Which which is basically dealing with everything to do with HTTP. I've got a simple Super lightweight pure node js server that is actually collecting the errors. It's just 90 lines of code It's very very simple. All it does is it takes the error and it dumps that into Redis That's it as as the error comes in it takes it and dumps that into Redis It uses cluster to sort of create multiple instances of itself so that you know, that is not a bottleneck Yeah, it uses cluster and it dumps them into a Redis queue And then there are multiple stages through which the error will pass going from app to app queue to queue And then finally getting written to the database so you can think of these as being map operations and then finally it gets reduced to Mongo it gets written to Mongo's that sort of like a reduced operation So it's like map and reduce happening, but it's using Redis and Mongo as the storage and and communication sort of layers Obviously when when you start having multiple apps and then you have multiple instances of the same app and so on you start running into problems with Consistency where let us say that one app has figured out that this data does not exist say in the case of deduplication, right? So There is an error and this error does not exist in the database so far So I discover this is a new error now and so I'm going to write this error Meanwhile somebody else must have queried and he must have figured out that's a new error as well And so now there are two people who think it's a new error and they are trying to write this right? So this gets you into an inconsistent state in the database. So you want to you want to figure out how to solve that And so there is this project that I created which is up on GitHub. You can you know feel free to use it It'll only help all of us It's it gives you a locking primitive using Redis so you can obtain a lock on something And so for example if you're going to write something to a database And you want to ensure that nobody else is going to write to the database at the same time You can see that I want to lock on that entry And then you can you know get someone else you can get the other Process to wait till that lock is released before you can write to it. So it's pretty interesting There's one small problem that remains I haven't solved this problem yet and you know It's sort of an ugly solution, but let's figure. Let's talk about this. There is shared logic, right? So for example when writing an error, I want to check rate limiting this that and the other as well but on the front end I also want to make sure that So, you know, there is the error logging piece that is doing that has to check rate limiting and so on But on the front end on the UI also I want to like present to the user if there is a rate limiting error that has happened And you know the logic for the rate limiting has to be common I have to figure out a way to share that logic across these applications How do I do that turns out like there were a lot of complicated approaches I took using some sort of so service oriented style so that there's one component that's talking to multiple things and so on Tried several solutions all of them failed in very interesting ways and we can talk about that at length But you know In the end it just caused a lot of errors a lot of down times. There were restarts that were happening very frequently To the point that I had to write a script that would do restarts for me You know, it had gotten to that point. It was just it was very ugly There was literally a cron that was running once every other was restarting my processes. It was just bad So, you know, how do you now share code across across Processes how does node do it was my first question turns out the answer is actually very simple They just use node modules, right, which is all your node modules sitting is that's the way to share code, right? So it's just very elegant So what I did was I created a models Folder which was all my app logic and then I just sim linked that into the node modules folder of all of my other applications, right? So it's very easy and So now the code is literally shared across multiple applications beautifully It works, but it has drawbacks and we can talk about that later So, you know in the end This is my stack at a reception. It uses express for the website It's a pure node server for catching errors Redis is used everywhere MongoDB is like there's mongoes layered on top of MongoDB to talk to mongo I use forever as a process manager. There are 24 node processes on production as opposed to just one when I had launched Now there are 24 processes that are sort of harmoniously working towards catching errors and You know, it's still working off one machine You know, it's now far more beefed up than it was before and there is still a failover machine as well But there's still one primary machine and everything is running off of that Very quickly towards the end I've got some ideas about how this can be improved and I'll probably start talking about this Very quickly in the community. So, you know, you could follow me on GitHub or somewhere and you know I'll be starting to explore these ideas. In fact, I'm hoping to work with some people who might have experience in this So if you if you have any ideas, you know, please please join in and we could like have a discussion there And that's all I have. Thanks Was I good on time? Yeah, any questions? Yeah, first of all, thank you for giving a great talk Thanks, we learned quite a lot from your experience the couple of questions I had yeah one how is your Redis lock different from using some primitive like set x check and There's a locking mechanism using right So the redis lock uses the locking mechanism that is described using set x on the redis website So it's a rather elaborate sort of process of like how to store keys and how to remove them and delete them And so on it is an implementation of exactly that algorithm though Can I ask go ahead? Go ahead? So one thing I would be curious to know is Did you face any problem scaling Mongo your seems to be a right heavy app and I've read a lot of stuff saying Mongo doesn't handle right. Yeah. No MongoDB is web scale No, I haven't faced any trouble with Mongo at all Mongo's been smooth sailing It it's a little hard to figure out how to do indexes and stuff like that, right in all honesty But maybe because I'm just an amateur with it in terms of figuring out how the queries will run and stuff like that But once I got that right Mongo was just it has not been a trouble at all. It has been it has been very smooth Are you were saying something? I have a question. I Love a reception. I love the concept of having a central place for all your errors and all thanks, but The web app the web applications are the only down there aren't the only things we're using nowadays. We have things like call Cosova Cordova, what is that Cordova Cordova, right? Yeah phone gap or we have titanium or we have offline web apps using web sequel index if you have blah blah blah what all not and These apps can go offline and online and a reception might not be available In fact, nothing on the internet might be available and there might be errors happening right this time Do you have anything in place or you plan to make something in place to catch these errors like buffer them up and Send them when you go online. No, no, I don't and you're the you're the web storage guy. That's why you're asking me this This is an important question because a lot of applications are gonna be offline eventually right the Chrome OS is picking up The many many web apps that are using web storage more frequently now be nice to have a buffering mechanism Right, if you're online, I'm guessing you kind of at least push it to a array somewhere Are you do you directly push it? No, it sits in an array. It gets buffered for some time and then send up to this Okay, so it's right. Yeah. Yeah, it is batched up because otherwise It doesn't process across sessions if you close the window and you come back right It does not process to cross sessions So it gets yeah, you're right and and you're right It probably makes sense to start adding and thinking about storage. You're definitely right Hey Start using a reception yourself. Of course. Yeah, that's up. It uses a reception. Yeah, so so So so at what I mean, how long did it take for you to actually start using it yourself? Well to be Absolutely honest. It took me well I did not use that reception because since I had built in 15 days There was actually no clients at JavaScript when I had launched it So it did not make sense for me to use that reception for the longest time But then after some time the moment I got JavaScript in on the UI to make it all Ajax see because you know We are all web 2.0. Are we still web 2.0? No, we're not web 2.0 anymore 4.0 But you know when I started adding JavaScript Reception was and you know again, this is a pitch though, but you know within within minutes I found my first error and I was able to fix it and deploy code within five minutes And that's just awesome the feeling of having to figure out errors that was otherwise not possible to find is just awesome Hey, this is Warren from Hyderabad Actually you in the last slide you had mentioned that you're using forever for the process watching, right? Right. So I recently tried out something called not.js. Mm-hmm. So have you I mean ever did a comparison with forever? Could you could what was that again? Not not.js and OT and yeah, you So it does almost the same thing wherein so it has an option for deployment also so it doesn't kill your existing process so that you can there'll be zero downtime and you can still serve your Existing request and so whenever the new process come up. It'll actually switch back to the new process Oh, that's interesting. I need to because that's very similar to what I was talking about in terms of yeah I'll have a look at that. I haven't heard of that before I'll have a look at that. That looks very interesting Thanks a lot for that. Thank you Hi, you spoke briefly about talking to DB Keeping the data in the memory right and you said talking to the DB was not too bad of idea, right? Can you explain what DB operations were you doing and? Where talking to that was not was better than keeping in memory So what I was doing was I was keeping a certain collection of errors that happened most frequently at the website in memory In the hope that I can speed that up and knock knock talk to the database So if I have it in memory now, I can just try to iterate over it and get that data out and then sort of use that You know for the most Frequently occurring error, so I don't have to go over to the DB frequently That was the idea turns out that was a very bad idea. And so that is where I got stuck So basically you get the most frequent errors at the in it and keeping them in the memory or you do it on the fly on the fly on the Fly. Okay. Thanks. No problem. Yeah, there's one question here Yeah, as I understand that two parts to your application one is collecting errors the other is to you I to view those So what I wanted to ask is is not a good option for the other part like where I Have application where I view data Well For someone like me, it's great because I know only JavaScript. So, you know, it's awesome But I guess if it's just simple crud apps, then maybe node is not necessarily Particularly great at that, you know, and I think everyone in the node community would agree with that as well Is that it? I have a minute more so you can fire questions if you want Hi, so you talked about scaling the node, but you haven't talked anything on the data storage side or the database side What's what's about scaling them and So I have not had any trouble with that. I for the sake of redundancy. I'm Having a Mongo cluster, which is essentially just one more server where it's getting replicated to but that's only to make sure that I don't lose data. I am not actually I have not had scaling trouble with my data store at all Could you like tell how much I owe is the per second kind of thing like how much I don't know the Latest number sadly, but I'm it's very very frequent. I'm talking to the database like like I'm mad I'm not sure it must be it must be to the tune of a couple of hundred queries every every maybe two or three seconds So it's like not I mean, it's not crazy, but it's not It's not low either, you know, it's pretty high. Thank you. No problem Thanks a lot guys