 Today we are fortunate to be witnessing not one but two of those once in a generation kind of revolution. First is the computing revolution. As you know we are moving away from the traditional PCs, hardware, data centers into the cloud. Second is the e-commerce revolution that is happening right here right now in Southeast Asia. My name is Surya. Today I am going to talk about how the red mark infrastructure has evolved as we live through this time. You know, you know, living at these times when you know almost every day we see new technologies, new tools coming up with each of these technology, you know, claiming to be performing certain things more effectively, simpler than the others. How do we actually live through that? In this talk we are going to talk about how, what are the kind of challenges that any organization have to face once they say I want to do DevOps and I want to scale my engineering team. First off, let's look at how our infrastructure has changed from the time of monolithic application architecture into the micro-service architecture. Just like any e-commerce startup, Redmart MVP was built on top of Magento. You know the business grew and then we saw customers are taking up the value proposition offered by Redmart and based on the traction and better understanding of what it takes to run a successful online grocery business, we saw these huge potentials but those are problems that can only be solved when you have a strong tech team. So our first version of API was built to create Magento, which was by the way a monolithic application, but by then the direction was set that micro-service architecture is the way to go and since then we never looked back. By the end of 2015 we are close to 100% on micro-services and as of today we have hundreds of our services running on our infrastructure. You know the arrival of AWS actually brings in new computing era where we are fortunate to be part of that early adopters. Now we started off with the classic EC2 models. That was at the time when we have to contend with the fact that most of our servers are residing on the public domain and they are easily publicly accessible. That was until the concept of VPC came about when AWS introduced the VPC and slowly we migrated. It didn't happen overnight. We migrated slowly whatever is in the public space into the private space and today most of our servers are inside the VPC. This is how our tech stack looks like. It is quite heterogeneous. We are more inclined towards open source rather than commercial solutions. We love proven technologies but we keep an open mind when it comes to new and emerging technologies. More importantly is that we believe that there are always multiple ways of looking and solving at the problem. What this means is that we believe that the one size fits all approach to solving any problem is never going to give us the most optimal solutions to the problem. So on AWS side we rely heavily on EC2 instances but we do use some of their other services as well like Kinesis, Redis and even the latest serverless Lambda function. On the development front we are largely Java and ScalaShop. We use both SQL and NoSQL as the persistent storage and then we use Redis as the in-memory cache. With such a diverse stack, one of the first key challenges that you have to deal with is ask yourself how do you actually easily create a feature test environment so that teams, developers can actually collaborate across teams. So in this diagram we see a very typical scenario, pretty much what we will add about a year back. So in this diagram we see that there are multiple developers working on a common feature. That feature happens to touch number of services. So without the kind of proper test environment setup for them, what happens is that you end up, let's say this developer one, he could be running a set of services on his own local machine. So whatever is there is mostly locally configured. And then developer two, the same thing, he might be running another set of services on his own local machines. And then in some cases they might have access to certain servers that is provisioned for them. And what happens is that they might just run any service on these machines. They can be running on any Rb3 ports. And so what ends up happening is developers will finish it and say it works on my machine, but what about the rest? So what happens to the other teams like the QA and even the end user? They will be asking, they will be confused. What exactly gets tested? When was it tested? How was it tested? Those are the kind of questions, problems that people will face. So what are the options we have? We basically have two options. The first option is it's a bit dark here. But essentially the first option is what I referred to as the brute force approach where what we can do is take whatever test environment that we have and replicate it to create a new feature test environment. But you know that is first of all, it's going to be costly and it's going to take time because you have to replicate the entire environment over. What about the second options? The second options require a bit more creativity. What it means is we can take whatever test environment that we already have and then based on the requirement of that new features, we create a delta. And then that sum up, they form a new test environment. This is how it looks like. By the way, we chose the second approach and this is how our test setup looks like. You know on the vertical axis you have four layers. It started with the front end where this is typically the page that you see when you go to redmart.com. So down one layer, we have the public API gateway. So we use NGINX to power our API gateway. So what this API gateway does is to take any incoming traffic and route it to the appropriate back end services. And then we have what we call our poor man's version of a service registry, which we are using Hitchaproxy. So what happens here is that every service is assigned a unique port on this service registry and it's accessible from all the other services through that unique port. What is more interesting is our, this is how the feature test environment looks like. So today entire redmart team, they see that as if we have N number of feature test sites, but actually we are only having one. The trick actually happens on the API gateway layer where we create two separate routes, one that is for the alpha environment and the other one is for the feature routing. So this feature routes is actually configured to be smart enough to determine if a particular request should it be routed to the feature instances or should it be routed to the stable test instances. Because remember that this is just a delta where only services that are required to be modified for that feature will be there. All these things looks good, but it is something that won't be manageable if you don't have a good CI CD workflow. As engineers, we are mostly lazy, right? But for a good reason, what does being lazy mean? It means that we always try to find the most effective way of doing certain things. So what does the developer have to do when they want to create a feature branch? They only need to do two steps. First is, you know, they just need to create a feature branch on their repository and then push it to GitHub. So what happens when they push this feature branch is it will trigger build process. And at the end of this build process, we will determine if this is a feature branch, it is supposed to go into the right feature bucket. Once this whole pipeline is completed, the second step is for them to go into our chat ops and issue a command to say that I want a new feature environment for this service with this feature name. So under the hood, what is happening is that there is a bot that is running behind. So what the bot does is to first create the EC2 instance and then configure this instance to be running a particular service. And it pulls the right artifacts from that feature location and it gets deployed. And then after that is completed, the next step will be to configure the routing itself, which is the API gateway level and then the internal service registry level. What comes out of this whole command is a fully functional test site that everybody can start using. And all these things are actually not possible without the help of good release engineering pipeline. And also we have built a quite significant amount of automation around chat ops. One of them obviously is the creation of this feature instance and then we do have other cool features coming up as well. One of them is actually to create a new service from scratch all the way until they get the instance ready. Our release engineering pipeline has actually come a long way since we first started. We started with the Gitflow model where we basically keep one master branch that gets deployed to the production environment. And then we have a developed branch that gets deployed to the alpha environment or the test environment. And then we have many feature branches that gets deployed to nowhere. That was the pass. And then late last year we started to move into GitHubflow with proper semantic versioning. So what we did is we get rid of the developed branch altogether. So we only have a single master branch. Codes that are sitting in the master branch actually gets deployed to both alpha and production environment. But by default, if a code is pushed into master branch, it only gets deployed to the alpha environment, not the production. So if you want to trigger deployment into production, there is a script that helps you do that. So you just trigger that script. What does script and specify? Whether you want to have, as with semantic versioning, there is major, minor, and patch version. You specify which version bump you want to do. So the version manager will determine. Okay. First of all, the script will actually create an empty commit with attached preformatted commands. So what the commands contain is basically the version bump that you specify when you execute that script. And then at the end of the script, it will see, okay, this is a production release. If it is a production release, I should trigger a release tag and then deploy it into production. Okay. This diagram actually makes it more clear what happens actually below. So the first step is for developers to commit that code. And once that is done, the build process will start. In Travis, we'll download the code and then compile it and then try to resolve all dependencies from the Nexus repository that we have. And then all the various tasks will be executed. And then we have the code coverage analysis. The results will be published into the SonarCube dashboard. The last part is where the version manager comes in. So as you can see, there's a major, minor version patch. So the version manager will determine is this supposed to be a production release? If it is a production release, it will return a release tag into GitHub. And then it will do the necessary bump here, whether it's x, y, z. And then it will actually push the artifacts into the right location. So we have three buckets, one production, alpha, and a feature. And then it will determine which environment it should be deploying to. That's exactly what's happening. So by the time you have this, you are pretty much ready to scale your team. Developers by now will be happy because they will be able to push their code anytime to the feature branch. And they know that it's getting deployed somewhere. And they don't have to worry that they will break something and have someone screaming at them saying something is broken. The creation of microservices will probably accelerate. By the time the next set of challenges will be there waiting for you. With so many microservices running, one of the key issues is how do you monitor your logs? How do you make each service visible? Any time you want to see the state of particular service, how do you do that? And most importantly is for a certain business transaction, how do you actually trace it? Because instead of looking at a single stream of logs, developers now will have to look at multiple streams of logs that is coming from multiple servers. So how do you make sure that it is easily traceable? Before you realize it, you have to deal with this huge mass of tangled services. And by now, when your developers are facing issues, euphoria starts to come down because they will start seeing all these black screens, multiple windows. So what you need is centralized logging. There are some companies actually that offer log management as a SAS. But choosing the log stack itself is just one part of the challenge. The other part of the challenge, which I would say is equally important, is how do you standardize logs across different services? When you have so many people working without proper standardization, everybody will be logging at their own way. This is where Daniel later will be presenting the micro service template that came in. It is a close collaboration with that and then it allows us to achieve what we have here. So for the technology choice, we actually choose the ELK stack if you guys know. Or if you haven't heard of it, it stands for Elasticsearch, Logstash, and Kibana. So if you go to the literature and start searching, this is actually the recommended architecture for ELK. So we have the source that is coming from the application itself. And then we have the log agent that monitors the output of the log itself. And then it usually sends this to a parser because the logs are usually in a raw format. It has to be parsed before they can be properly indexed and then stored in the Elasticsearch. And we use Kibana for the visualization. What most people will recommend is having this cache in between. But we didn't see this architecture as a silver bullet to all logging problems. And we did certain experiments and this is the architecture we came up with. Of course, if you can see from this diagram, the main thing is we get rid of the cache layer. But to be able to do that and make sure that the stack is still reliable and scalable, we did quite a few things. There were two important things there. First is that we look at the team structure, you know, and then we decided to create the cluster based on the team structure itself. So different teams, they have different scaling needs that allows us to scale the log cluster based on their needs. Second is, you know, instead of sending it in a raw format, we send it in a JSON format. So raw format, what I'm referring to is usually the lines where it is separated by tap. What this two combination does is there are at least four advantages that we get. The first advantage is, you know, with that JSON log format, we can reduce the computation required in the parser. So it does not have to parse heavily, which means that we can run a less powerful instance for the log parser. Second is it gives flexibility to the developers. It means that any time they feel like they want to add certain fields into the log, they can do that without, you know, requiring any changes on the parser side. And the third thing is the multi-line events, which is especially important if you want to see your stack trace print out and all that. Of course, there's a plug-in available for that, but we found it to be quite unstable. But all these multi-line events is taken care when you use the JSON parser. And last but not least is, of course, you know, the cost factor. You know, we are able to run this whole stack at a fraction of the cost of what would have cost if we go with the recommended architecture model. Okay, ELK is good for application logging, right? But what about time series metrics? We tried a couple of options, you know. There are a lot of sexy time series DB solutions and all that. But after evaluation, we decided to go back to the proven stack, rather, because we found that some of those tools are good for, you know, startup to develop, you know, POC and all that. But once you say you want to deploy it into a full-scale production mode, they might not be ready yet. So this is our metrics collection stack. Similarly, you know, we borrowed the concept of team clustering for this. And then we have a stats decollector for each cluster. And then at the back end, we actually scale it according to the team as well. One of the key benefits that we get out of this is, immediately, we get to plug whatever cable monitoring that we have directly into Grafana. And then we get to monitor all these metrics immediately. And the second thing is, it pays the way for us to go into the next phase of release engineering, which is the Canary releases. Okay, so far so good. But what about service discovery? If you have a new service, new instance, how do you actually make sure that is discoverable? So we did also try out a few alternatives for the service discovery, and we choose console over all the other options. But this is how our service discovery looks like before we roll out our console service cluster. So here, Chef is actually the one where it keeps track of the service and all the servers that is running, that is configured to run that service. And then we have the htproxy that checks from Chef directory itself, and it generates the config based on the data that's available in Chef. So this is the process that is going on. If you try to first create a new service, first is you need to update Chef to tell the Chef that, you know, I'm running this service, and then which instance it's running on. And then after that, you have to do in the next step, which is to update htproxy. And then you have the problem of horizontal scaling. If, let's say, certain services requires more capacity, you have to do two steps to bootstrap the EC2 instance and then to update the htproxy after that. So this is what our console cluster looks like. So basically we have three nodes that is acting as the master, which basically manage the cluster management and all that. And then we have the console agent running on all the servers. So what you see here is that the console agent gets to check the application, the health of the application itself. And then the application itself, it is able to check with console if I want to talk to another service, where shall I go to? So the console agent that's running locally will be able to give that information. You know, so earlier I didn't really explain why we chose console, but based on what we have achieved in the completion of phase one, we get at least three additional things out of this. First, of course, is the service discovery itself. When you create a new service, it's immediately discoverable. Second is you get them to do more granular and customizable health check. Previously, we only have two health status, which is passing and critical. By the time it hits critical, maybe everything is too late. So we have right now the warning state where we can monitor not only the service endpoint, but then also the system level indicators like CPU, disk, and memory utilization. And then any instance that fails the health check will get removed from the cluster immediately. The other thing that we get from here is the key value store. Imagine you have hundreds of machines and they are sharing a common config. How do you make sure that if you change a config, it gets propagated to all the servers without doing anything? This is the thing that you get with the console key value. You can actually set each agent to monitor a certain key, and if that key is updated, it will immediately, you know, you can update your environment variable or application config and all that. What is next left for us is to figure out the inter-process communications, which is going to be helpful when we do auto scaling and all that as well. So this is the few things that are coming up in our pipeline. So even though we have come so far, but yet, I mean, there's still a lot of exciting things coming up. Like I said, you know, the serverless lambda function, the IM consolidation, basically how do we secure resources, you know, and allows onboarding and off-boarding people in a very effective manner. And then, like I said, you know, chat-offs is one of the big initiatives as well. We are adding a lot of capability around that. And then we are going into going into the chaos engineering. Instead of being passive and reactive to failures, we want to be proactive in anticipating failures. And, yeah, Canary releases, you know, instead of doing one shot, updating all the production instances, can we have, you know, certain instances updated, direct certain amount of traffic into the new version? Once we have a certain confidence level, then we will start routing to 100% of the traffic to the new version. Otherwise, we'll fall back. So, yeah, those are some of the things. Okay, with that, yeah, thank you. Do we have time for Q&A or maybe you can continue first. We'll do the Q&A.