 Welcome all. Here we have Eric telling us about building a reasonable popular website for the first time. Give him a clap. Thanks. First of all, I can't get the screen for configurations exactly right, so I'll do this without my notes. Please excuse me if it goes wrong. So, I'm going to talk about building a recent popular web application for the first time because I got lucky enough into being able to architect, build and design something that grew quite quickly and got to learn to deal with scale way quicker than I would have expected. So, I learned a lot of this during this time and I'd like to share what we learned, so hopefully you can do at least skip doing the mistakes we did and make your own unique ones instead. So, who am I? Why am I here speaking? I'm the co-founder and chief architect at a company called Hotjar. Hotjar, both the name of the company and our product, is a set of web analytics and feedback tools. So, basically, this means a lot of data ingestion. We are installed on almost 200,000 sites in the world right now, so a lot of data coming in. I'll give you some numbers later. So, my development career started a long, long time ago at the age of six. I wrote my first game. It wasn't that awesome probably in retrospect, but I thought it was. So, I got hooked on programming and I've been ever since. And after that, I transitioned between different tech stacks and throughout the years, but started with Python about seven years ago now, and it's the one I definitely like the most so far. So, since I'm going to talk to you about something recently popular, recently big, let me, it's only fair that you give a definition of what I think is, you know, recently big, right? So, Hotjar right now, we post around 400,000 API requests every minute, or CDN delivers about 10 terabytes of data to our users every day, and we have roughly three terabytes of data in our primary data store. It's a Postgres, and another two terabytes in our Elasticsearch cluster, and it's almost between 35 and 40 terabytes on Amazon S3. So, that's our definition of recently popular, recently big for today. Still use reasonably standard solutions, though. Our tech stack isn't anything out of the ordinary, as you can see here. Nginx, Memcached, MicroWiskey, Python, Elasticsearch, Lua, Postgres, and Redis. It works amazingly well to just run a load of MicroWiskey workers, even at this scale, believe it or not. At some point, we will of course start using all the fans in US, Inc.io and UV Loop and all these things. It's probably going to be a great match for us, but for now, very plain process-based MicroWiskey scales really well. So, now that you have some context, let me start out with what we learned during the last two years, kind of. So, login monitor from day one. This is something we messed up a bit, because we only started logging and aggregating logs once we started having problems. At that point, though, we added so much log data coming in, so we had to spend quite a lot of time cleaning things up before we could actually see through the noise. So, start logging and aggregating your logs from day one, and, you know, keep your logs clean. Act on the problems you see. Otherwise, you're going to have a mess cleaning it up when you need to, and it's going to be like a, it's kind of a depth as well, not managing your logs. Have a way to profile your API calls. So, we ended up using SQL Alchemy as an ORM. It's great, and I love it like 95% of the time. But every now and then, you have this like little, innocent line of Python code that causes some really weird query, and having a way to profile both code and database queries is great. We have the concept that our super users, ourselves only, can actually append question mark profile equals one to any API call in the query string. Instead of returning the normal results, that makes the endpoint return C profile data and SQL Alchemy profile data. So, having like an easy way to get profile data from a live API call in the live environment in just a few seconds is actually great. It makes your profile a lot more, and you get a much better understanding of your system as a whole. So, highly recommended to have a way to just ad hoc profile a query from live environment, you know, great to do. Sometimes it's the Python code that takes time, sometimes it's, you know, database. But you'll be surprised how often like the Python code is actually, you know, you do a little mistake in SQL Alchemy, that's really heavy in processing. So, it's a great thing to do. No one thinks fail. So, at some point, we had to add some cron jobs I don't remember quite for what, but, you know, some background processing. And, yeah, they failed at some point without us noticing, because it was a silent failure. It exited for some unknown reason. It didn't throw an exception or anything like that, so we were obviously monitoring for that, but it just failed silently. So, it's just as important to know when things are not happening as to know as when, you know, bad things are happening. So, we solved this by adding the simple concept of job expectations and job results. A job expectation is something simple, like, I expect this job to run every hour. A job result is simply a log entry from the job that it writes when it's complete. So, then we basically just have a status endpoint that's called by external third-party service and basically checks that all expectations are satisfied all the time. That way, we know that jobs run and they run on time and they run successfully. So, always beware about safeguarding against things that fail explicitly and things that fail silently. Just as important and easy to miss. And also, third-party systems to monitor your own systems as well, because, you know, you're monitoring my fails. Have a way to keep secrets. Hotjar, as everything else, started out as an experiment of sorts. So, you know, we weren't too diligent about not maybe keeping external API keys in source control and stuff. In hindsight, stupid of us, but, you know, that is. Then, as the development team grow, we realized, okay, maybe it's not the best idea that everyone has access to all third-party systems like through APIs, you know, in live environments. So, I'd recommend to use something like Ansible Vault or similar, like from day one. It's going to pay off. Because now we didn't. So, at the time when we, like, had to start, you know, keeping secrets, we had to change all the API keys and that and stuff was on us. So, have a way to keep secrets from day one. This is an interesting one. Everything needs a limit, even if it's big. So, a good example there is we have the concept of tags. We can basically tag a recording. It's used for, we envisioned it to be used for people saying, okay, this recording, the user visited the checkout page in this recording. However, our users used it slightly differently. Some of them. They tagged each recording with unique user IDs coming from our third-party systems like Google Analytics. So, that meant some users ended up with 400,000 different tags. We showed that in a little nice HTML select dropdown. 400,000 select dropdown options does not render well. Our interface broke terribly because we didn't have limits in it. Users are very creative, and if you give them a way to put like limitless amounts of information, they will. And these limits goes, it goes for UI, it goes for APIs, length of fields, stuff like that. It also obviously goes for databases, length of fields. You can never ever allow unlimited. Perfectly fine to allow really big, but unlimited is bad. If you give your users a way to put unlimited amounts of data in your system, they will eventually. It took like a year, but then it happened. Another one here is slightly more interesting, I'd say, and much more surprising. So, we had our two-something billion rows. That was kind of hectic trying to solve that when everything was done. Because I didn't even anticipate it. Never worked with data to scale before, but it happens. So, think about when trying to design your schema. You have to be able to do that. You have to be able to do that. You have to be able to do that. Try to think ahead, a year or two. I know it's hard, but try. This is a possibility I could end up with, like, reaching data type limits if I use this type here. If you think you're even going to be close, choose a bigger data type. It's not expensive. It's just not default, so you have to make a conscious choice. But think about how your data will grow. And if possible, put monitoring in place for this as well. When you're about to reach limits halfway there, you want to know if you have time to plan migration. Don't get too attached to a framework. Right now, we're using Flask and Flask RESTful. It works really well. We're super happy with it. But at four or five hundred thousand requests per minute, it's starting to have a significant overhead, because most of our requests are really quickly processed. So the framework matters. This, of course, depends on your use case. But for us, it matters. So at some point, we're probably going to have to transition to something else. So a good advice to minimize the pain of doing that is to use framework agnostic libraries as much as possible. Like SQL Alchemy is a great example, because it works like it has adapters for basically everything, and if it doesn't easily do on yourself. I don't have anything against using what I like to call thin wrappers, like Flask SQL Alchemy, because it basically doesn't do that much. It's just a nice helper. But if you were to, you know, if you switched away from Flask, you could easily implement what Flask SQL Alchemy does to yourself. So thin wrappers, fine. Otherwise, I try to avoid framework-specific libraries. It's kind of like, you know, vendor lock-in, framework lock-in, limits your flexibility. Choose components which allow for language interoperability. So we're definitely mainly a Python shop. But we have about, like, half a percent, one percent, of a code base in Lua, actually, for performance reasons, running inside Nginx. We did the mistake of using a queuing system called RQ, initially. It's a great system. But Python-only. And this caused some issues when we basically just wanted our Lua code to put some simple things in the queue that end up being a much bigger thing now, because, you know, we couldn't put it there because it was a Python-only queue. So when possible, choose components, components, libraries, servers, whatever, you know, that allow for greater language interoperability. It's so easy if you have, like, a performance critical part to just take it out and write it in something else. Plan for database downtime. So, yeah. In the beginning, all our database migrations, schema migrations were simple because we had basically no users and no data. It gets harder, and at some point in time, we ended up, you know, we couldn't just do our basic table statements anymore because they started taking significant amounts of time. Fair enough, there are some Qt SQL tricks you can do to alleviate some of them. But at some points, you have to, like, introduce kind of a downtime. However, this is a nice trick that helps a bit. Try to decouple data ingestion from data processing as much as possible. A neat way to do it is capture data from the user, put in a queue, process later. That way you become much more resilient to having database downtime. Even if it's just for, you know, a minute, you need to take it down, do a little change. But if you have, like, this queueing, queue as a buffer, it's great. It's not always possible to do this, obviously, but it's a great thing to do when you can. So, have a way to share settings between backend and frontend code. We introduced a couple of syllabugs a couple of times, simply because we were lazy. We copied things from backend and frontend. And then we changed one of them, but not the other. And the frontend and backend code didn't agree on values anymore. So, this is just sealant-upid, and there's a very simple solution. We ended up having a settings.json file, which contains our shared settings. It's injected using nginx server-side includes. And that way, Python can read the JSON, and the frontend can read the contract JSON as well. So, super simple. All our shared settings go there, and no more bugs of this kind. So, shared settings are good. We duplicate things like error codes and stuff. Copy-pasted. Now, shared settings are not a problem anymore. Have a way to go into maintenance mode. What I mean by maintenance mode is basically a little page saying, we're currently down. Sorry. It's not nice when you have to bring it up, but it's probably going to happen to every one of us at some point. And then it's a great insurance having them. We basically have a very little switch to turn on and off the maintenance page. And when doing the maintenance page, be careful and let it have as few external dependencies as possible, because, you know, you probably want to turn it on like when your database server crashed or something. So, don't store the switch to turn it on in the database, because it already crashed. That was our first version that did just that. Also, on our maintenance page, we've put into communication tool where people can talk with our support crew. It's a really good idea, I think, to keep communications open with users, even when bad things happen. Feature flags are a great way to test things out before releasing them to everyone. So, at this point in time, we started getting really big, and we wouldn't want to release things we weren't too sure about to everyone. So, we introduced feature flags. We have both server-side and client-side feature flags. So, basically, we say, this part of the UI requires this feature, and this part of the API requires this feature. That way, we can do gradual rollouts. We can do beta testing with a limited group of people. And, yeah, we can also do things like enabling things depending on which type of plan the user is on. So, saying, if you're on the pro plan, you get this feature. So, it's a very versatile tool to have. If you start thinking in terms of on-and-off feature switches, I highly recommend very simple to implement. Great thing in your toolbooks. Except different quality of code for different parts of the systems. This was personally kind of a hard one for me, because, you know, as a developer, you kind of get attached to what you created, and you want it to be super awesome everywhere. But it can't, because then you run out of time. So, for example, we require all our user-facing code to be properly tested, performing well, all these things. However, imagine you have like a back-office report for internal use. It's okay if it performs so, so if it takes five seconds to generate, it's okay. But think about these things up front before starting to build a new feature. How good does my documentation need to be here? How well does it need to perform? How well does it need to be tested? In an ideal world, everything would be perfectly documented, tested, and perform awesome. But when you need to prioritize, think about it up front. It helps a lot. And these are basically the most noteworthy things we've learned. Kind of not unique things, but surprising things, I'd say, most of them. I'm sure we still have many new things to learn, but this is it for now. Thank you for listening. Who has any questions? Why was the SQL Alchemy chosen compared to Jingo, for example, what were the reasons, and how do you feel about going with Flask so far? Okay, yep. Do you have an SQL Alchemy instead of Django, you said? Jingo. Okay, yep. Well, SQL Alchemy, we actually started out with a different ORM called Peewee. But for some of our very performance critical things, we needed, you know, we didn't want to go right and go SQL, that's why we used an ORM. And we felt that actually SQL Alchemy allows you to drop down, like a mid-level, and still do really long squares, while, and I don't think Django ORM is even, I prefer, I can say like this, 90% of the products you ever do Django ORM is awesome. But the SQL Alchemy, when you really need to do these weird performance optimizations and use very post-progress specific features and stuff, I found it a bit better. But we could have done it with Django ORM. Absolutely. However, we already decided on Flask because of simple benchmarking. Flask is quite a lot faster than Django, even if you sweep out all the middlewares and whatnot. So we didn't really have a natural tie into Django, if you get what I mean. And then SQL Alchemy was a good choice. And I still think it is. Thanks. Any more questions? Thanks. Could you get a bit more detail on the implementation of the maintenance mode page? Yes. We're having to do that currently. Absolutely. It's a very simple thing. 30-second background of our deployments work. We basically push things to the bucket. We have the servers pull it and update themselves. So entering maintenance mode is basically, we run the deployment check in the script, but through Jenkins, and basically check the box maintenance mode instead. So Jenkins deploys, servers pick it up. This takes about 20, 30 seconds kind of. What they basically do is, during our build pipeline, executing on Jenkins, we actually have conditionals in the NGNX configs. And basically this is as simple as like, if maintenance mode, show this page, static HTML. Any more? If I were to add anything to this excellent guidelines, of course, there are endless such guidelines, but what have proven to be useful, especially for our company, that would be writing utilities for testing server. Just small clients, because you can write unit tests, but unit tests use prepared environments, not very production made. So if you can just quick run your clients in production and test what fails, that is also good. And I think making everything deployable with tools like Puppet, so you can easily just boot a new server and make it build very fast. It is also linked to virtualization. So that is very useful. Cool. About the profiling. So do you use anything else, other than having this ability to see the live profiling? We do a lot of things. I just picked this because I think it's, I haven't seen it that much before, but we're heavy users of New Relic and we use PGstat statements in Postgres. It's an awesome thing. Very small extension, adds extremely little overhead, less than 1% in most cases. And it basically generalizes queries. So independent of query parameters, it groups queries for you and it gives mean execution time average, standard deviation, stuff like that. So if you want to really find slow queries, PGstat statements for day-to-day monitoring, New Relic, and that's basically it for performance monitoring. How do you limit the profiling only to the staff of users, I suppose? That's simple. You have to log in in the system as a normal user, but then we have like a little, it's a super user flag for certain users we have put in DB. So that's simple. And the Python decorator called requires super user. So only loud byte. Anymore? Awesome. Enjoy your lunch. Thanks for coming. Thank you.