 Hi, hi, so how many of you have worked with microservices, lot, how many of you are planning to move to microservices, almost everyone has heard of it, right, so lot has been, no discussed about microservices assets, right, so, but the rationale began this talk, right, why, how we should design microservices, let me give an example why that is needed, let me say I want to do like food ordering app, so if I am making food ordering app, developer or architect A will come in and say that, okay, let us make restaurants and menus a microservice and let us just make orders another microservice and then let us have a load balancer and issued traffic, right, then the architect B will come in and say, no, no, let us make restaurant a microservice and menu items are another microservice, right, and then the third person will come in and they will say, no, let us make each dish a microservice, panitika will be a microservice, kadai panir will be a microservice, thanks again, so you see the old spectrum of people who are having different viewpoints are what is a microservice, so this is somewhat analogous to normalization in DP that we do, right, so either you have completely normalized tables or you have like a flat structure of this thing, so we will look at that in the stock, right, quickly about me, I have primarily worked at startups and built systems from scratch, right, and I am interested in distributed computing and building scalable systems and I purely enjoy coding in Python, previously enjoyed in C, so that means I am very, very picky in what I want to do, right, the company I work for, I work for match sheet then, we have this retail brand called View.a, where we have this B2B products which are APIs which are hosted in Amazon web services and these are e-commerce companies or other brands who are consuming this for machine learning or personalization which so we deploy machine learning models at scale, we are hosted on AWS and we serve like 200 MS latency requests, right, these are some of our API numbers, we serve pretty huge traffic because we are amalgamating a lot of e-commerce traffic to us, this is very important because the use case I do is something very similar to what we do, right, yeah, yes, I am innovating, Microsoft is very well, so they solve a lot of problems, what are the problems they solve, so basically essentially when you do a Microsoft service you are building a graph, you are building a graph where there are nodes and relationships, nodes are each Microsoft services, the relationships are, what are the data flowing through the Microsoft services and how they talk to each other and once you start off with the graph of, if you think of your system as a graph, that is, graph will keep evolving over time and if you do not have initial set of graph design right, then the nodes become inflexible and you will not able to play around the pieces like Logo blocks that easily, right, so let us go ahead and design them, I did a small exercise, so I want to do this exercise, is there like a common literature to design Microsoft services, so I went to Google, I tried how to design Microsoft services, I took the top 10 pages and this is the word cloud I got, this is a bunch of generic terms thrown at us, right, so out of which two of the keywords made a lot of sense to me, data and database, so essentially in engineering or software systems we are only building data, systems that are either communicating some data somewhere or storing some data somewhere, you index in different ways, you query it in different ways, you send it via binary stream, you send it via JSON, it does not matter, it is data and another thing, so if you keep data at the core of your Microsoft services, let us see how the exercise comes in, so as, you know, follow up to that, primarily what are the ways in which we store databases, let us look at each of the ways we do that, right, at the top of the list we have OLTPR relational databases, these are like standard use cases, if you do not know what database to use, you just use a relational database, right, and you have multiple tables, you are doing joints across them, you essentially use them for any web app kind of use cases, but they are bottlenecked on writes, you can only have a single write, they are not inherently distributed systems, that means that it is not scale horizontally and your master is write bottlenecked, second you have key value or document stores, these are eventually consistent systems which are horizontally scalable, that means that you can store one, but you cannot do joints there, that means you can store a lot of data of one type and spread it across system, like users in social media or tweets in Twitter, right, then you have in memory data store, these are data stores like Redis or Memcache, which primarily store the database in RAM, they do not store the database in disk and RAM, do the memory management for you, where are they used, wherever high performance is required, they use it, because latency of Redis is a single Redis call is less than 1ms, that is very low in compared to what RDBMS or Cassandra can give to you, right, these are essentially used as caches and other queues which are niche below latency, then you have the OLAP and column non-databases, which are used for analytical querying purposes, where the same data, which are stored as single table is split as columns and stored across different machines, so that your querying and your analytical queries are very fast, your search indexes, which are inverted index of your data, that is each of the item in your data is tokenized and the term counts or frequencies of the entire data is counted and stored, you have object stores which are just flat files, but you have this huge infrastructure like S3, where you can throw any number of files at it and it will store it, this is a old array of way you can organize your data and store it, each of which has a different set of SLA and different set of promise it offers, so we have to choose what kind of data base that we have to use, right, and this is not exhaustive list, there are more adding to it, so let's take one use case, right, so this is a very simple use case, which is image classification, so given an image, classify a single attribute for me, the attribute that I have taken here, classify whether the person is wearing a full sleeve or off sleeve, how do we solve this problem, this is a business use case I want to solve, so how do I solve this problem is, I will have to build a neural network using computer vision, that means that I have to build, feed in a lot of images to the system and train a model and use that model to infer and serve results, right, if I have to build a like a simple architecture diagram for this use case, this is how it would be, right, so like I told there is a set of image data source, which is passing through ingestion data pipeline, which is a set of queues where someone like through a pipeline, does image classification modeling, pushes it to a model store and there are a bunch of metadata that are associated with the images, like which user has uploaded it, what is the title of the image and things like that, that goes into data query store, finally when someone, it's an API, it's a combination of which user served the image plus how the model is trained, the model inference service, serves a request and finally the API is served, if you take the data, how the data is flowing and stored, data query store is like a multi-object system, where you have multiple models like your objects like users, permissions and things like that, accounts, so you essentially choose a RDBMS kind of system for it, if you take a model store, a single neural network image take some space between about 200 MB to one gig or two gigs, depends on number of layers and architecture and things like that, so you obviously cannot use any of the RDBMS or key value system, you have to use something like S3, which is a file store or a block storage like EBS or EFS, right, so that store is very different and data ingestion is in a queue, which is very different, so this is the simple use case diagram, right, we will take another use case, this use case is product recommendation, so given an email product, recommend me some more products of the same or similar attributes, right, so this is essentially a search problem, here you see a person looking at a shirt or jacket and they are recommended something of the same color or similar attributes, right, so how this use case is achieved is by again catalog and inventory data, right, what is the catalog or the set of images that is in the catalog, someone has to know for them to search through it, so it is a similar process where someone takes the inventory data and extracts features out of it and puts to a feature search store, right, a feature search operation is essentially a distance function operation, that means that you need a search index kind of store and not like a rdbms kind of system, so that is a search index, data curry store is similar to the use case which you saw before, which is rdbms, then you have, you can also use the events information like which product someone viewed, what is the product that they bought and things like that, so if I have already bought something, that means that I will buy more of that, it gives us some signals, so that is like a structured event stream that flows into the system and events data modeling is done and those models are stored, right, an event store is like a number of tweets, there will be like millions, hundreds of millions of that data, right, so it makes sense to put that into a key value store, you can also choose rdbms, but if you want to scale horizontally, you can go for a key value store, so that is like a different system, right, so this is a model architecture where again the AP is eating a service and it is going to a different search store, there is a pipeline that is pushing data to different stores and the results are being served, what we will do now is, we will merge to the, like I want to build a platform, I want to build a platform where let's say the business will come like to, come up with two, three use cases, right, today, like tomorrow those use cases will change, so that means that Microsoft's graph cannot change with business use cases, let's say they, tomorrow they want to come up with like a completely different use cases, it is not image based and text based, that means that there needs to be a Microsoft which supports text, right, so I want to build a platform, so if I, you know, compress this into a simple architecture diagram, this is how it comes like, right, there is a bunch of data pipelines which is writing to a bunch of data stores which are actually purpose built, like a search index is purpose built for search operations, RDBMS is for specific type of filter operations, right, a key values, again a specific type of operations, I cannot do aggregations on top of key value, then a set of like pipeline services which gets derived data out of those stores, then write it back again, for example the feature is a derived data, it's not like coming as part of the incoming pipeline, finally what you see in green here are those services which are compute, so you want to compute something at runtime and then send back results, so these are like API facing service services, what you see at the below API aggregation is a composite service, so you want a single endpoint where when a bunch of request comes in, it routes to different services and it microservices and finally like assembles the results, right, but it's okay, I now come up with this exercise to analyze microservices, I see that I can derive architecture diagrams based on data and you know come up with this thematic structure, but will I put this in production? No, there are a bunch of questions that needs to be answered, how to take this to production, so this is an exercise we are doing from scratch to production, so let's answer those questions that we want to answer to take it to production, right, so this gives us a pattern actually, so we see that a microservices are actually grouped by their functionality that they perform or the category that they fall into, like the green one you saw like it's like a API compute service, the yellow one is like a composition service and a gateway service, you have two sets of pipeline services, one pipeline service is a batch, something you want to be processing like twice a day or thrice a day, you want to use like a Kafka kind of system, you will use Spark kind of system, right, so that essentially software that is powering it's completely different, a pipeline which is a stream, you use a streaming application, right, and then there are also data reader and microservices, these data reader microservices what they do is that they shield from users from misusing the database, like you want them to use it in a certain way, if multiple microservices are directly accessing the data stores, probably they won't use it in the most prudent way, so these are the classes that you identify and you then go into thinking how do they communicate with each other, then there is, in real world when you deploy, there is SLA requirements, there is latency throughput, those are some of the parameters that you want to support, right, so the essential thing that comes when you want to do support like a very high throughput latency is that your system has to have caching enabled, that was not there in our initial architecture, that means that we need to have like a in-memory system which acts as a cache and that flows through system, right, and other cases that let's say you are doing microservices chaining, there is one microservice which is calling multiple microservices, or microservice A is calling microservice B five times for each call and if microservice B latency is let's say 10ms and it bumps up by to 20ms, that means that microservice A's contract has gone by 50ms, right, so this kind of cascading effects will take place, so caching has to be handled each layer of microservice, so these are some of the considerations that we need to take into account, then you need to take into account geographical regions, so if you are starting your business in North America and you put up your data center, you take cloud provider of your choice and then you start, let's say start with AWS or GCP and then provision endpoint there, and you suddenly get a customer from Singapore, then the single basic op time from AWS North America to Singapore is one second, so all your assumptions goes for a task, right, your API went from 200ms to 1.2 seconds, just like that, and that cannot be your basic platform design, right, what are the ways you can mitigate it, let's say you are a read-a-view company, you can use CDNs, or you can have region specific deployments, right, or in some cases you can have, some cases where the data that is being sent, read by any part of the world should be the same and dynamic, you need to have globally distributed databases, like what Google spanners or summer system supports, right, so logging and monitoring, I think all of us know that logging and monitoring is very important and we have also figured out ways using Elk and to use that, but we have really not figured out feedback loops, right, that is one thing which is missed out, right, when an alert is raised, how is it being closed and what are failures, defining that is very important based on your logging infrastructure. We spoke about one specific type of data, which is the data is stored in data source, we will move on to others, other type of data, which is data which is being communicated between microservices, right, microservices needs to communicate with each user, right, and they need to communicate really fast, the operation of communication between them should be cheap. If you see here, then it takes us to the conversation of service registry, right, service registry is something where microservice registers themselves saying I'm here and you can send me any communication at this end point. So this is like a system, feature that we can envision for a service registry. At peak point, you can have like, let's say 20 to 30 production servers running at any time, right, and you want, I'm a big fan of feature flags, by the way, like any feature that we do, we want to like put feature flags and then push it out, right? So suddenly you have a feature that is turned off and you want to turn it on. If it takes the data needs to be communicated across, if it takes redeploying all the 20 services, then it's a system failure, right? It cannot be. So if someone goes to service registry and changes the config for service, then it actually has to immediately flow into all the 20 instances and immediately it switches that feature flag, right? So the other use case here could be, we just are thinking about config in terms of each service managing their own config, but service A can send config changes to another config. An example here is that let's say service 2 has a feature flag called talk to cats. It can talk to cats and it's turned off. But for a specific user, service 1 wants to enable that feature. So service 1 sends a message to service registry and service 2 updates its flag immediately at the runtime for that specific user, right? So these are all ways in which communication could be made really cheap, right? So for this specific use case of how microservice is coming with each user, we created this service like our own microservice registry called vane. So this is a simple web app, right? And we had like really four very small design principles when we came up with the service. Like that actually secure config management for per client, per deployment, and per environment. So we have a globally distributed data center operations. That means that this is very essential that this happens. And I already spoke about the real time changes. The config push from one service to another needs to be really cheap. It needs to be 1MS, right? It cannot be often over. Then as a developer, I would be really reluctant to push changes. Resilient notification system. If a microservice is down, that means that the notification can never fail. And yeah, a microservice can communicate its desired config changes to other services, right? That was another feature. So this is like a very simple service we wrote. It's nothing fancy. It's just a PostgreSQL DB plus like a Python web app where there is a service. And it gets notified through Lambda through any of these channels. A service can register through SQLs or ALB or EC2 server. And it will get notified. So this is a very similar flow that I discussed where a config change comes in. Each microservice is assigned admin or APA key through which it sends its config change. And the notification Lambda does exponential decay. So even if service is down, it will eventually get its notification. Like it will bring a two seconds or four seconds or eight seconds. So it has exponential decay. So if service is down for like one minute, it doesn't matter. The notification will reach it. The way we have handled the instant notification update is that each server will have Redis running inside its machine, right? Redis being single threaded. It's very efficient where the config changes are not actually loaded into the memory of the process. There are like 20 to 30 processes which are running in each server, right? Each process will actually read its config from the Redis which is like a very low cost top. So the config will actually go and update the Redis and the server will instantaneously get the config changes, right? Very simple but very effective because all of this is powered because Redis is single threaded, right? There are no race conditions there. So this is another flow where service one is sending message to service two. Again, this is the same flow just that depending on the channel it will go and update service two, right? Okay, so we have considered a bunch of observations that we want for our production infrastructure. It was latency, geography and logging, things like that. So if I have to put like a single core platform architecture out of this, this would look something like this, right? You have a bunch of data pipes. What are the new things here? There were a bunch of data pipelines which write to online data stores and there is like data lake or offline data stores. You have caching infra which is introduced new which will power all your low latency AP aggregation request. So everything you see above the dotted line is a platform. Everything you see below that or apps or products that can be built on top of the platform. Since every service, if every service is designed as per the function of the data that is handled there is a very little chance that that service will have like a very systemic change over the course of the product roadmap, right? So you can build multiple SaaS apps on top of it. In fact, that's what we actually did, right? At the right, we have the service registry and discovery, notification service, all the components that we discussed about, right? Which will tie together all these pieces and create this one infrastructure. Let me quickly summarize the learnings that we did by all these exercise, right? It is the same question of normalization, right? How you want to normalize your microservice graph to the optimum level. And what one end you have brittle and another end you have rigid. You need to find the balance. Don't have single point of failure. That means that don't have like completely decentralized system, right? You can have like considerable like multiple single points of failure which are failure itself or like less in impact. And master your data stores, right? Because a single data store takes some time to master and you have to respect it and carry along with your architecture. For the third learning for us is that every API needs to have a synchronous and asynchronous API, right? Because you have batch operations, you have every operation can take time, right? You can send results immediately or they can submit a batch and get it back or this thing. But if this is a paradigm that is enforced across your microservices, it becomes very simple because the developer is not coming back and saying, oh, I need to introduce a synchronous API for you. No, we felt that that needs to be like a mandated thing. And we spoke about real-time feedback loops. Like, how do I actually have alarms so that it adds as input to my system building, right? And communication between microservices, solving that on day one, especially since we are multi-region and multiple deployment kind of setup, that solved a lot of problems for us. Finally, we also declared microservices as end-of-life, right? You can try to maintain microservice till you think it has life. Otherwise, you have to like, let's say you have a Spark-based batch system and you have migrated all your systems to like real-time. You don't want to like really, you know, keep hanging onto the system. Just an example, right? So yeah, so that summarizes the learnings that we have had. And yeah, questions? What if the scalability has to be on a massive scale, right? As you said, one of the examples when we need to, say, put the whole config on a global level. So any other example which you can say on this? So we kind of contained our scale limits, right? Your app player will scale like horizontally, but then your DB layer will have its limits. So we benchmarked saying that each read replica will have, will support this much scale. So let's say if the answer is predicted, we'll add like 3D replicas and things like that. What about security related? So yeah, like the previous talk, security definitely was not the first focus when we started building it, right? Security remained at the perimeter level, right? But then we really started ordering our AWS instances like having golden AMAs and having it, you know, reduce the level of intuition. So we introduced ideas and things like that. But in terms of security itself, from the data point of view, for all data stores mandate encryption from first, right? For all communications and SCTPS, and there is no like non-encrypted communication that is happening. That was like the, from micro-service and data here that was the main thing. Question here. You given analogy of no, don't put all your X in one basket, but don't put one in each basket. Can you talk more about that? I mean, how does it relate to micro-services? Is it not too fine grained micro-services, is that? Yeah, like I said, like the example I gave, right? Where some companies like Bosch, they have like 500 plus 600 micro-services. But at the scale of things, if you see, that's decentralized things too much. And basically you need to have the workforce to support it. Like the actual ratio is that if you have like three, two, five, four members per micro-service, and there's a maintenance cadence to that team and they have ownership and they maintain it over time, right? But the ownership falls flat at some point because there's multiple code bases and they eventually get merged into one, right? Another example would be like, you really like find over time, right? You have one service which is taking from a database and then giving it to something and doing a compute, right? We have all seen that. That is kind of really unnecessary because someone says, this is IO operation, that is a compute operation, you have to keep it separate. But that distinction is kind of great where if you have the gut feeling like the ownership of the team actually lies with one micro-service, keep it one, right? Yeah, yeah, we do merge it, right? So it's the graph keeps evolving. We should try to make the best graph initially but evolving it shouldn't be difficult. That's the basic criteria, right? But if you get the database layer and the SLA's of that wrong, you have to do complete migrations which is like a different level of evolvement, right? So let's say you have like streams of data coming in like which will scale to 300 million, 300 million, 400 million, things like that and you decide to put it in a RDPMS. Eventually you will have to move out of the system because you cannot like scale out that system, right? If your choice of data store is wrong, you definitely will have to rewrite, yeah. What is your recommendation for error handling in terms of communication between micro-services? Sorry? Error handling between micro-services communication. A is calling B and that is calling C and something went wrong. Yes. So that is mainly one is distributed tracing. Like in the injection pipeline, every incoming request is assigned a unique ID, right? So first part is monitoring then the second is error handling. So monitoring we have dashboards where we clearly see that the injection has not completed for this many products, right? Because you have this unique ID and you take like a pivot on that ID, you will get that data on a dashboard, right? Second thing is you have fallback services, right? Which is, service A is failing and there is a lot of pressure because you are pushing even more traffic on the service. It creates a back pressure. So you have fallback services or fallback data stores. For example, we have a RDPMS system which is a fallback for a solar system but that's not the primary purpose of RDPMS. It's serving different function. In case that system starts failing, we route the traffic to solar RDPMS, yeah. Right, so some services are multi-tenant, different services are multi-tenant in different ways. A single service can process multiple clients data, right? That is like a typical SAS kind of multi-tenants where you put all your clients in one DB and kind of thing. But some services cannot be handled like that. For example, if I am putting a solar cluster for one client which has like a huge data, that data set will be so huge that I have to spin up one more cluster for the other thing. So the handling might be different for different customers there. So multi-tenancy is not like a one solution, like say we either put all the clients in one, all the service like that or you have like different services it is different, yeah. Architecture once, yeah. Engines are convention, perspective, all the services are enabled. And from your data pipeline, what, can you explain what technology are you using to deliver the data pipeline? Right, so date for, so I will like repeat what I said, right? So data pipeline, we have multiple solutions for data pipeline. Like I said, a data pipeline cannot be one. We have Spark-based data pipeline. We have our own Enos data pipeline which is based on Kinesis on AWS or SQS on AWS. So there are multiple data pipelines. In terms of data stores itself, essentially if your data is scaling out a lot, you cannot keep adding them into online stores. For example, a journal data, if this product was this much price, currently is what online data store will have. But offline system will have, at this date, the price was this, at this date, the price was this. So it will have like a time journal to data which will scale very differently, right? But you don't want to lose out on data. So you move all those data on offline data stores. The modeling service, what is that is, it needs a lot of data. Basically, for to generate features for machine learning and all that, it will look up, use offline data store. Modeling services are basically batch jobs which will look up those data and generate models and put it across again to online data stores, right? It will like transuse it and put it there. So the end services, essentially will query only from online data stores. Offline data stores, essentially access input to the modeling stores.