 Hi everyone, my name is David González and I'm going to talk to you about fighting COVID-19 with serverless JavaScript. First of all, some presentations. I work as a DevOps architect for a company called Neo4n. By the way, we are hiring. If you are looking for a job, just send me an email and I'll be more than happy to introduce you to the pertinent people. I'm also a part-time lecturer at CCT College in Dublin. I lecture mainly in cloud and DevOps and in general software development. I am what my wife calls a technology geek because I like technology so much. And I got a few years back a nomination as a Google Developer Expert in GCP, but I'm also a member of the NodeJS Foundation in the Security Working Group. I wrote two books, one of them called Developing Microservices with NodeJS and another one called Implementing Modern DevOps and they are there for you to check out. And today we are going to talk about the problem of COVID in particular. So think of the situation where you get into a booth and you spend like half an hour going into your destination, but turns out the next day you test positive for COVID-19. So how do you warn the people who have been with you in the booth, which are a close contact that you have COVID-19 and they have been in touch with you, then they should be tested, but you don't know them. So that's a very sort of tricky problem. And in the beginning of all the COVID pandemic, we didn't know too much about the disease. We didn't know how close or for how long you should have been in contact to a person, but the reality is that the virus was spreading very quickly. There's another interesting point about COVID-19 is that everybody talks about the herd immunity at 70%, like once 70% of the people are immune to the virus. In theory, we should have what the scientists called herd immunity, which means there are not enough individuals who can't transmit the virus. So the pandemic will reduce and eventually will disappear. But that's not the reality. The reality is that the herd immunity depends on the transmissibility number of the virus, in particular the air of that formula that we all learn in college or in high school and we all hate it, which is how easy does the virus spread across the people. So that's why when the pandemic started, many of pretty much all the countries in the world dictated severe lockdown where people will not be able to get close to each other and block the people in clusters. So that means that herd immunity number or that herd number will go down ideally below one and the herd immunity will be less than 70% until we have a solution for the virus. So one way to avoid this transmission is by isolating close contacts, as I said before. So if you are only seeing your family, the reality is that you will only transmit the virus to your family and the only possible vector is somebody in that cluster going out for shopping or going out contacting other people. So we reduce the ability for the virus to spread. What is a close contact with time? It was determined that it's 15 minutes closer than two meters and that is what Google and Apple implemented in the contact tracing API in Android and iOS. So at the year from we built a contact tracer initially for the government of Ireland for the HSE, which is the health service executive here in Ireland, which is where I live by the way. And the obvious choice for us was using AWS. We know AWS. We use AWS in pretty much every project. But then there was another not so obvious choice, which was a good car, which was using serverless. In that case, we use a lot of JavaScript. When I mean a lot of JavaScript is that pretty much absolutely everything, well, pretty much not absolutely everything in the application is written in JavaScript from the application itself built in React Native to the backend API built in Node.js. Landas were built also in Node.js. And, you know, like pretty much every single service was a JavaScript base. We deployed them in Landas and ECS Fargate. For those who are not very familiar with serverless, in general, when people talk about serverless, they refer to Landas. The reality is that there is a way more catalog of services than Landas in AWS, Google Cloud or Azure. That will greatly help you to deploy and evolve your applications. One of them is ECS Fargate. So if you are familiar with orchestration like Kubernetes or ECS itself, you need to have a cluster of VMs which will run your containers. On the Fargate version of ECS or EKS, you really don't need to manage that cluster because that's managed by AWS. You just need to worry about how much memory and how many CPUs you need for your tasks and AWS will do the rest. The only component in our infrastructure that was not serverless was RDS. And the reason for that is because when we started the project itself, and that was built very quickly, by the way, was built in something like three weeks. And that was very impressive by the person who started. And we actually did not want to risk to use serverless database. The RDS was not super mature and we prefer to go for the, you know, the comfort area of our database skills. The API gateway here is marked in bold and there is a, I would say a funny joke, but it's not very funny about the API gateway. So the API gateway is key pretty much for every distributed infrastructure nowadays. I've seen many customers mainly building Node.js applications which fail miserably because they don't have an API gateway strategy. In general, we made a heavy usage of the API gateway because every single call into the system will hit the API gateway. And then with JWT and distributed authentication, we could redirect the user from a ECS Fargate method in a Fastify server into a Lambda who will do something else or even S3. Then we also use the usual suspects, SQS, SNS, PPC, like, you know, the typical building blocks of AWS that everybody would use. Also, serverless has another problem. Okay. And this is a spoiler alert. Landas are bad with three dots and we will, I will explain you more about that in a second. So Landas are a very dangerous ally. Node.js and Landas are the perfect, perfect combination in AWS because the speed at what you can develop a system with some components based in Landa. In real, I mean, you can be very productive at modeling a system with Landas, but if you model your full system using Landas, it's what I call the 2021 version of let's use a store procedures. You just don't do that. Don't do that. Usually, in my opinion, you should actually combine Landas with some other components like ECS Fargate, Knative, or, you know, like lately Google, for example, in GCP have released something called GKE Autopilot, which allows you to run a Kubernetes cluster without having any servers. I know you also have ECS, EKS Fargate, but the issue with EKS Fargate that comes at a compromise, like for example, you cannot use diamond sets or things like that. My opinion, if you ask me, Knative is going to be the world. Knative is a serverless implementation on top of Kubernetes, which gives you all the building blocks of a serverless architecture that run nicely into Kubernetes. And then together with the cluster autoscaler and a few other touches here and there, you can actually end up having a very interesting server side combination of primitives. The buggy in a distributed message space system is the first circle of fail. I can tell you because the picture on the right is the picture of me after solving a book for like about 12 hours in a row. So we all know how hard it could be to follow the flow of a JavaScript application using promises or using callbacks or using a sync await. I hope everybody is using a sync await. It's just like there's no reason for anybody to go back to callbacks right now. And if you are using promises, you are using a sync await because it's just a syntactic sugar on top of promises. If you are trying to follow the execution of a request across a number of services, lambdas and cloud native services like API gateway. Good luck with that because you are going to need like a few hours of coming in and out. So my advice is make your system as simple and as straightforward as possible. I read this warning on LinkedIn, a very interesting phrase, which basically said architecture and software, it's there to make your life easier. Everything else is just a region and that is something that should be a mantra in serverless, especially working with JavaScript. I love JavaScript. It's, you know, like my go to language for pretty much 99% of the work I do nowadays. But the reality is the JavaScript flow is somehow hard to follow because our brain is used to a procedural language. But with JavaScript, that's not the case. Even though with a sync await, you can make it look like, but it's really not there. You need to have things into consideration. But imagine combining that with a very complex flow of a very complex flow of servers passing requests to other servers or lambdas passing requests to the lambdas. It can get really, really problematic. And this is the continuation of the three dots. And lambdas are bad in the majority of the cases, but there are other times where lambdas are their perfect, like they find their perfect. For example, for an async task runner, let's imagine, you don't need to imagine. I will show you the code of that in a second is that you have a lambda which needs to run every night to send a statistics to the central statistic office to recalculate how many people have been exposed to COVID or things like that. And so lambdas are the perfect place for that because you can delegate into your infrastructure the running of a lambda. And that makes your life so much easy because you don't need to worry about application being up at the time. You don't need to worry about retries, mechanism and things like that. So it's just convenient. They're also the perfect JWT validator in the API gateway. So one of the reasons of the API gateway to be alive or to be in the architecture of pretty much every microservices system is that you probably are using distributed authentication with JWT. And you really want to do that, that validation and number of verification at the edge because otherwise you need to do that in every service and can get very problematic. So one of the things that we did as a design principle is every single request that needs to be authenticated will be authenticated in the API gateway calling a node. Yes, lambda, which was validating the JWT token. Then once the token is validated, you can pass the request into the required server with the confidence that the token is validated and the person or the phone or the system. Who is calling the server? Actually, it's an agent one and it has been validated. Another very good point on lambdas is that the scalability is just out of this world. You can easily choke other services. I'm going to give you here very interesting situation where we needed the secret for the JWT validation. And what happened was that that secret was stored in the secrets manager. So on peaks in Ireland and in other tenants, we actually had over 4,000 requests per second into the secrets manager. And if I recall correctly, the maximum hard limit of secrets manager per account per second was set to 2,000 or 4,000. I don't actually remember, but it was like our number of requests were over that on peaks. And we found out because we were getting some errors that we couldn't explain until we moved the secret from the secrets manager into an environment variable in the JWT validator lambda. I know it's not the perfect solution, but we had to make a compromise in there in order to make sure that first we won't kill the secrets manager. And second, we will save like a significant amount of money on calls into the secrets manager. Lambdas are also very, very, very, very cheap unless they inspire a lot of control. Like if you don't have a good mechanism to check lambdas and to keep them at bay, you can end up with a serious hefty build from AWS or from any other cloud provider. But if you have them under control, you will be paying less than 100 US per month to run a significant amount of code and workloads that will help you to actually deliver your system with a very high level of stability. And then latency, like many people will come to me saying, oh, but you know, lambdas introduce latency, I will tell them latency, what latency? Like you can work around that. You can actually build your Node.js lambda using MIDI or any other middleware. And then your call start will be subsequent and every subsequent call will actually be immediate. There's not going to be call runs anymore and the lambda will scale up or down depending on load. And basically that latency is I think on the past. Of course, there are some situations where your like lambda will play against you. But the reality is that it's actually has gotten good and it doesn't worry me anymore. The call start or similar. So it's very fast. One of the patterns that we heavily use in the COVID applications was we had or what we call or exposes lambda, which was a Node.js lambda. Which was actually getting the least of all the random IDs that got exposed into like, like they got tested, got a positive and positive result as you are infected by COVID. Then that person gets an SMS from the health provider and they load the random ID into the server and then all the funds in the tenant don't load the file with the random IDs. So there's no way to track who is who is actually the person who got infected. But your phone will download the full list of people and then match your close contacts with the people on well with the random IDs on that list. And if there is a match, it will send you a notification saying, hey, you've been exposed to somebody with with COVID. Please go and get tested in order to serve these files. What we usually do is we pre calculate them in lambdas like that's an async task every five minutes. The lambda fights off and then creates a file with the last people exposed with the last random IDs exposed from the database. And then we serve them from the API gateway straight from S3. So that means when your phone, which is the majority of the cost that we do back into the servers is asking for, give me the list of people who, the list of devices who have been tested positive and we serve that straight from S3 and literal words from AWS. S3 has billions of concurrent requests available for you. So you won't be able to source the bandwidth of S3. And that's what they call a winner winner chicken dinner because out of like 20, 30 or 40 millions of requests a day, about 15 millions, something like that will be phones and looking for the list of the exposed devices. And that's it's all managed by AWS. We don't even execute any code ourselves. It's just API gateway S3. Well, the only code we execute is the GDGT validator, which is a very small node is lambda, but everything else is just straight serve from S3. How does it look like? So we have these as open source. So I will show you the code and the repositories in a second. What we do is we keep the philosophy of build one instance and deploy it everywhere approach. Most of the time, okay, we have some tenants which have some special requirements and also thanks telephone for being at work. So telephone is amazing, but there are some corners where telephone plays against you. So the fact that if you want to do an if else, if something created resource, otherwise don't create it, converts a simple resource into an array of resources, it's just mind blowing. And the fact that there is not an easy way to just include or remove modules until Terraform CDO 13. It's another interesting fact, but all in all, Terraform is a powerful ally. So just be careful with some corners. The way we build it is we build the gold standard, which is the COVID green infrastructure repository. And then we use that as a module in every single tenant, injecting the configuration that we want, plus some customization that we need for each tenant. So that way we can actually release a new version of COVID green and then use by every single tenant and roll it out to the nine tenants per two environments. That means 18 environments pretty much immediately. So we were very mindful to have a repeatable process. And aside from that repeatable process, we wanted it to be as simple as possible. So complexity is the evil number one on every system. So if you can deploy with one click and rollback with one click, you're almost safe from disaster. So yeah, that's that's our approach. The applications themselves are reactive. So we are near from we breathe Node.js inside out. We, I think we contribute to the code of Node.js, something like 40% of the code. It's people working in Node.js. And so it made sense to use reactive because like we know JavaScript. We like, you know, I say Node.js, but the reality is it's just JavaScript and we know react very well. So having the application done in VR native is, you know, the obvious choice. And also the COVID green initiative is a Linux foundation public health project. So that means that we are part, well, the project is part of the Linux foundation. And what I'm going to do now is I'm going to go and show you the codes. And I'm not going to show you the code as off a demo or something like that. But I'm going to show you that in GitHub, if you go into COVID green, you will see a number of repositories. So we have, in general, the top one is the COVID green infrastructure. We're not the top because it was used the most because it was the, the updated the latest. And so it's where our infrastructure lives in. You can come in. This is public, by the way, and there's no restrictions in this. And you can see how this is being worked by column at the moment. And this is where you have all the infrastructure we have built. You also have some level of documentation, which is not always super accurate, but in general describes the system very well and allows you to pick around and see what we have built. Obviously it's not perfect, but I'm of the opinion that perfect is the enemy of dawn. So I preferred it to be working than, you know, being running around. Seeking perfection where there was a global pandemic and we needed this out very quickly and very efficiently. If I can take one second of your attention, I'd like to be very grateful to all the people contributing to this project because the team was amazing. And it would not have been possible without any single of them. So that's my cut out for them. We also have other interesting depositories, which is, for example, the backend API. So this backend API is just the server side that we we build in in your form for supporting the server side loads. In general, well, in general, no, it follows the standard layout that we have in your form. So we use fastify fastify allows us to have plugins. And then what we do is we configure our plugins and for Postgre and verifying the devices like that's the device verification to prevent tamper devices. Some gdbt code here, but also allows us to inject the allows us to inject the the routes as plugins. So whenever you go into a route, for example, metrics, you will find three files. Well, four files in this case, including the test, the first one is the index, which is where is the handler. And this is a fastify handler. Okay, it's a fastify handler, which has become a plugin. And then we do a schema validation. This is chemo validation. It's coming from a file called schema. And this schema defines how should your request look like. Super important every time to accept data from your users. Always without any exception, validated. Otherwise, you might end up in a security in a security issue, or creating a vulnerability, which it's because of the lack of proper input validation, which might end up in a command injection, grossed scripting, or, you know, God knows what. So I'd be very careful with not validating data. Then we also have the query, which is how do we fetch data from the database. And this is, you know, like it's not even the fetch the data is just paid in the query. We use a library called SQL, which we built in house in your form, which prevents SQL injection by using prepared statement, even look, even though like, even though it looks like you have just concatenate the string. The reality is this SQL library will go take all these parameters and create that prepared statement with it. So it's safe to actually, it's safe to actually use it this way. I will actually recommend you to go and check it because it's one of those interesting ideas. If you go to any other routes, you will see always or pretty much always the same, the same structure. So index, test, test, query schema, and it's the same everywhere. So index query schema and test. So what does that mean? That means that somebody ramping up on this application does not take long because basically just need to copy and paste a route, imitate what's there, remove what he doesn't or she doesn't need, and then start coding around. So that becomes very helpful. Then the last repository I want to talk to you about is the lambdas. As I said before, or lambdas are kind of like created to support the main workload, which is the API. And if you remember, I was talking about one lambda, which was packing the, the closed contacts into a file and then putting it, putting it in an S3 book that then will have been served into customers via API gateway and S3. So this is the lambda that does that is the exposures. Again, it's all open source. You can go there and check, check what you want. But this will actually do that work and it will give us, it will give us the required file, which then will be served via the API gateway and the phones will get the data that they are looking for. So that's the perfect usage for lambda. Then we use other patterns here, like for example, you don't see here an SMS lambda because every single tenant is using a different SMS provider. So what we do is we send a SQS message, which is basically a message into a queue and then the lambda it's created per tenant. So that means Ireland will have one provider, which I believe in this case might be Twilio. The UK regions will have other provider, which is a proprietary provider that they built for an SMS notifications and so on. The reality is the message is the same for every single tenant, but how you deliver that message is different. So that is the couple using a lambda and it's working fairly well. All the automation to deploy these is done in two phases. The first one is we deploy the infrastructure and we use GitHub Actions for that. GitHub Actions is fantastic because it's very close to your code. It only have one, I will not call it issue, but I will call shortcoming, which is there is not a manual validation step in the middle between one job and another. So that means if I want to do a plan on Terraform and then an apply out of that plan, it becomes very hard because I don't have this intermediary hold on. So I will call it issue step and then apply. So what we've done otherwise is we have a job, which is plan and then we see the plan and then another job, which will be plan and apply. So that means you see the plan first and then create the plan again and apply, which will apply the changes in the infrastructure. We make a clear separation between infrastructure and code. So that means we deploy the lambdas as part of the infrastructure. You can see that here on every single lambda. But if you go to, for example, the exposures lambda, which is what we have already talked about, what we deploy is a plates holder. We don't deploy the final code. Why? Well, for one simple reason, in this project, we have separated the deployment and pretty much everybody who is involved in the project can do deployments at any stage. Not that we are a big team. We are like on the biggest moment. We were like five people, but we have a very high level of communication. We also don't do deployments alone. And that allows us to deploy the infrastructure and then deploy the lambdas without having to redeploy the infrastructure and same with ACS Fargate. So in that case, the lambdas were deployed with a GitHub actions pipeline, whereas the infrastructure will be another pipeline. So that way we kind of segregate what the blast radius is, if something goes wrong. Again, some people will be like, oh, you should deploy lambdas with Terraform as well. But it's just, you know, we thought it was a good idea and we actually believe it's still one of the best ideas that we could have taken. So back into the presentation now. So we have learned a few lessons along the way. The first one is the Terraform ecosystem is a mess. If you are using Terraform for party modules, the return on investment is not massive, but you are getting yourself into a corner. And I have an example here, which is RDS. There is one module that creates an RDS cluster. And basically what happened is at a given version. That module was calling the cluster default. And then one version later, the cluster was called primary and secondary. I guess it's because some change on the way that RDS deploys are similar. But what does that mean? That means that we have to recreate the cluster or we have to actually move the state elements across one by one for nine tenants by two environments per tenant. So that means 18 tenants, very stressful, not very desired thing to do. So the only solution we have there, well, not the only, but the most, let's say, same solution was to fork the RDS module and patch it to keep the nomenclature. So it's a bit unfortunate. The same way that I would not recommend anybody to create a Node.js API without using Fastify or Happy or any other HTTP library because it will be a nonsense. I would actually recommend and encourage everybody to write your infrastructure in Terraform without using third-party modules. Unless it's very clear that the third-party module serves one purpose and it doesn't go into the wild because it can get you into a really, really bad corner. Missing the async in one of the Fastify handles can destroy the throughput in your cloud-native system. So at some point we had containers restarting. Turns out it was, we missed the async in one of the Fastify handlers. And what happened was that, of course, when you throw an error instead of reserving into a promise, it will actually throw an error and kill your JavaScript application, kill your Node.js application. So the container has to restart. And that happens very often. Your throughput goes into the floor. So be very careful with that. And I wish there was an easy way to detect non-async handlers and issue a warning or something like that. But it's something that happened too much yet on that. Monitoring is harder than you think. There is a sweet spot between the host system fire but nobody cares because you have one alert every 20 seconds. Well, because you don't have alerts basically. And can you stop alerting me on every request twice because you have too many alerts. And it's incredibly narrow that space. Like where you're alerting is perfect. It's almost a non-existent state. It's like, you know, when you do yoga or something like that, this purity and clarity of mind. It's not about the destination. It's about the trip. So this is the same. Monitoring is not build it once, forgets, and it will always work because it won't. You should always have two monitoring levels, infrastructure and app monitoring. So for example, what happens if your node event loop starts having delays? You won't catch that with the infrastructure monitoring. But if your database is down, your application is going to fail, but you probably will need to monitor that on the infrastructure. So we are very heavy on infrastructure monitoring here and very light on app monitoring. But we absolutely capture every single log line for at least 30 days for every single line of code executed in our system. So if something is wrong, we can catch it very quickly and we can actually act very quickly. So the other day I actually amazed myself because I managed to go from I am looking into something else and into all this one alert. Let me look into that in about 15 seconds because we've gotten so good with CloudWatch that it gets up to a point where it's almost like an extension of your or your hands. So yeah, that's one of the recommendations I have. Vertical slices of the systems are better than big systems. So that is on the domain of system design. Should you be sharding your system? Should you be crusted in your system? My advice is if you can slice them vertically in countries or in regions or in user groups, it's going to make your life a lot easier. If I tell you we are processing 85 millions of requests a day, you will be like, oh my God, that's a lot. But well, actually it's more than 85. It's like a few hundred, but yeah. If I tell you we are processing five millions of requests a day, 80 times is not super impressive because five millions of requests a day is pretty much manageable with two servers. So that is something that I'm grateful the approach wasn't global because otherwise having to manage a big system, it's actually much harder than managing a smaller system. Especially if you are using infrastructure as code and repeatable infrastructure, it's just the overhead is minimal. So yeah, I would recommend you to slide your system vertically and be the master of your trade and apprentice of the rest of them. That's DevOps. So you need to focus in one area like me, for example, infrastructure, but also be able to read and understand Node.js and sometimes write it. Because basically like you cannot isolate yourself if you want your team to succeed. And that's about it. Thank you very much. This is my email. That's my Twitter handle. My LinkedIn is there, of course. Feel free to connect with me. And like as this is being recorded, I am not sure if there's going to be questions, but if there are, you have any questions. Feel absolutely free to contact me over email, Twitter or LinkedIn and I'll be more than happy to answer all of them. Thank you very much and see you soon.