 Our Saturday mornings too hard for our folks to come by is like the pandemic changed like are you because this is when most meetups used to be right? Exactly. Saturday mornings and Sunday mornings or evenings, right? That has been the Bangalore norm for the last at least five, six years that I have been here. That's correct. See there's obviously one trip towards we have to physically go to that location and yeah yeah makes sense. Let's see. Okay so okay we seem to be live here right now so welcome to the March edition of Observability Meetup Bangalore. We are still seeing people turn up so we'll give it like a couple of minutes for some more people to join before I hand it over to Vivev. So Vivev, let's give this like around five minutes and then we can just start. I'll just say hi to people on YouTube. So for folks watching on YouTube feel free to ask questions, feel free to you know put in questions there and we will sort of propagate it back to the Zoom session here and we'll select around the top five to six questions of all the participants and at the end of his talk Vivev and also later on Piyush who is here to join us. They would try to address those questions and concerns. One thumb rule that I would like to state here is that try and formulate the questions in a question format and try and not go into statement format. Sometimes that ends up happening so and then we'll be able to you know sort of prioritize your question if it is pertinent enough. Okay so Vivev I think like since we are already live you know we should just start and I hope like people will start trickling in and we are seeing a peak in YouTube live stream count already so at least folks are watching there which is great to know. I hope all folks will be here with us but yeah with that why don't you just you know start with just introducing yourself a bit while you walk you know what do you do? Sure so first of all hello all the folks who are listening to us and it's great that you have been spared some time for the possibility to meet up out there on Saturday morning. So coming back to your question Joy I've been working with RazerPay for quite some time now over three years and it's been fun working with them. We have been working and solving most of the observability platform problems where we tend to solve how do we scale metrics maybe scale traces and everything on scale and that's where the talk will be revolving around on how do we probably solve all of the metrics problems that RazerPay using with Chlorometrics. We'll be discussing around what the go-to solution would be and what was the journey basically to actually reach to a point where we decided that okay we have to come use with Chlorometrics for scaling metrics. Cool so I think yeah welcome Viva and I see Piyush is also here Piyush is our second speaker he'll be speaking at the second session welcome and you know Viva thanks a lot for you know agreeing to speak today it should be a great talk going ahead and we already have like around 10-15 people watching on YouTube so why don't you start with the whole conversation. So what an honor to you screencast. Thank you Jaya thanks a lot I'm just screencasting I hope it's visible. Yeah it's all good your voice is also great we can all hear you. Awesome so hello folks once again welcome we will be talking about Victoria Metrics at scale today on the also we'll be meeting up for the March edition. I've already given a brief interview about myself I'm working as a lead DevOps at Rezipi I'll try to blog a lot of things that usually come across as learnings to me so you know you can always go and check out on my social media handles. Going forward let's see what is the agenda of the today's approximately one hour 50 minutes to one hour things I'll be discussing about Prometheus which is a well-known monitoring solution for any Kubernetes monitoring whenever you move in towards a microservice model you always see and if you want to think about monitoring you go towards Prometheus we'll talk about possible solutions for what are the issues that comes off with Prometheus then we'll talk about the main handle of Victoria Metrics how it works what are all the components then we'll talk about what is the scale that we are actually referring to here right we are not referring to any what is the scale that we're referring to and the next most important question that comes around is that how much do you actually spend on the infrastructure so we'll definitely have it and then at the end of the session we'll have some time for the questions that you'll be having in and I'll try to solve it at the best I can okay so metric solution everyone like you're new to Kubernetes you just google around on things that okay I need to monitor my Kubernetes cluster the first thing that you get out there is Prometheus right you get Prometheus out there to be the first bit of any more continuous monitoring solution and the reasons why Prometheus is so popular out there in the community is because it is very easy to configure you don't have to do a lot of juggling around if you have to if you're just starting off a bit right it's very easy configurations and they are head charts available over there for installation easy installation everything is out there so you have a very good variety of things then you have a lot of exporters available so suppose if you want to if you want to monitor your node health then you have node exporters out there if you want to monitor your for Kubernetes cluster you have kubesit matrix C advisor and whatnot right everything right now in this current industry or whatever is coming off is basically emitting metrics that are there with in the Prometheus format so it has a lot of exporters out there and the community is moving towards it and and the last point is obvious right now because Prometheus has intend to want a lot of attraction being the first one to solve the problem of monitoring so Prometheus did gain in action and like everyone else we also thought of moving towards when we move towards Kubernetes back there three years back three three and a half years back we we have probably chose Prometheus as the monitoring solution that we have like everyone else would have gone into right but things don't things changes with the scale right it's it's very easy to maintain a 10 node cluster 20 node cluster but when you go at a scale of more than 100 node 200 node 300 nodes then it becomes very difficult to you for you to manage right because the scale actually changes a lot of things and before I get into the issues of it I'll just now brief out on the basic architecture of Prometheus that we have so if you see on the left side of this diagram you have your all your exporters and all your kubesit matrix node 3 advisors and whatnot then this this is a Prometheus server that we're talking about so you have a Prometheus server that basically scrapes all these metrics and points via the configuration whatever targets you have been configured to and what it actually does is it it basically does a curl call to HTTP endpoint on them slash metrics thing it is the path and everything is configurable so in general the standard is slash metrics so it goes to a slash metrics graphs the metrics and keeps it in its own local storage which is a persistent volume storage that you see here and then this is used by a graph this is all the metrics have been visualized by and for alerting system and all that same you have alert manager that they have broken into a different component so you have alert manager which basically does the job of notifying it to slack or any of the waybooks that you can configure so this is the basic architecture how Prometheus looks like now let's talk about the issues and the problems right so I believe whoever are there in the system and who have been installing installed Prometheus for the monitoring infrastructure would be very familiar to a similar graph like this like where you get the metrics till here and then there's a broken metrics out there and then your metrics again starts going to this is a very common scenario in when you use these kinds of things is because Prometheus doesn't support HBA design so Prometheus on its own on its own GitHub repo has and on the documentation have designed that by design will not go HBA so it has something called as a federation server which which kind of links to multiple small Prometheus servers that can happen but it's not HBA still if your one Prometheus server is down it's down so this and then that is the reason you see a lot of gaps in your metrics if if you're not a proper things offer then there's a problem of long term storage with with the number of metrics that ruin your volume size keeps on increasing and when your volume size keeps on increasing whenever Prometheus restarts the wall size increases and it takes a lot of time to come back up so you'll find many issues out there at Prometheus community as well to see that because of all it takes a lot of time to come back up and then you have a single server for both scraping and visualization what do you mean by this is that you have only one single point here which is doing both of the jobs here you are scraping all these objects from here and you are also helping with the queries that that are coming via Grafana okay let's continue and then some of these things happens so you these are nothing but issues that have been reported on the Prometheus open source repository where you get out of memory errors you get higher CPUs and targets cannot be scraped and then you have a lot of things right so the reason of these is there are many increased number of nodes or any increased number of targets it keeps on happening on the same TSTB storage beneath it right it's a local storage and the local storage keeps on increasing so whenever you are doing any any queries or on top of it it tends to use a lot of CPU a lot of memory and it is both intensive so it'll take it'll start using the memory first and then your CPU will be blocked on the memory to respond so all these things happen now this a similar thing happened with us at recipe so we had a lot of downtimes on the monitoring infrastructure just because of these things because Prometheus was not not stable enough so we threw in hard way to solve the problem but that was not the ultimate solution for the problem right so we we started looking out for other alternatives towards Prometheus and how it can work I'll probably touch base on the other problems like other solutions like Thanos and Hortix and then we'll go dig deep into it so Thanos has is a very good was the first alternative that we drove up and Thanos basically gives you solve your long-term retention by using a remote backend file s3 or something of that sort but the problem with s3 native apis it doesn't support Thanos doesn't support dietary structures within s3 and that's the main block because s3's root structure if you if you have multiple small small objects like over 10,000 20,000 objects within the s3 bucket without any dietary structures then you're probably going into then then you're probably like it it's become very difficult for you to do a s3 api and that's where your Thanos becomes a very bottleneck because it doesn't have a solution of making dietary structures on s3. So the next thing came off with Cortex. Cortex on theory looks very good it has all different components for different different solutions different different problem segments but the problem is that it's like at least for us it sound it became too much remote too many moving components all the time too many cook saws all the time so the problem became out of there from Cortex right there there are a lot of components and it became very difficult to manage it it became very difficult to manage and then we came off with Victoria method we started evaluating with Victoria methods. Now I'll probably transform it with this picture what what happens is that your entire monitoring infrastructure with Prometheus is this big shape that is on and so it becomes a monolithic kind of architecture right we are doing all the things at one single goal you are you are scraping it from the same server you are basically computing your alerting rules from the same server you are getting resulting to the queries on the same server and it becomes it has become this big shape so what was required is to divide into multiple small small cute fishes which can probably solve their problems on their own one so each one takes care of one we don't have to like it's not necessary to think that's what Victoria methods did at very very big right so it divided the things into three different parts one is your storage and query where you store data query data second comes your scraping and third comes your alerting now within storage and querying also since it's a very big part it divided things into three different components which is VM select, VM insert and VM storage I'll come to the detailed architecture with VM select, VM insert and VM so what it basically tried to do is the similar thing on that is pictorially represented here it divided the whole big shark into small small components that took care of their own respective things so now even if one component goes down your other things still keeps on working the other functionality still keeps on working it doesn't have to matter where your function goes down another has also has to go down and that's what the basic underlying architecture of microservices now here we have enter the queen which is Victoria metrics and this is the common architecture so if we see out here I'll talk about the the storage and querying thing so you have VM inserts you have VM storage and VM select and then you have your regular prometheus for doing you can have multiple agents for doing your scraping part so what happens out here is you have prometheus or influx db or graphite or open tstb anything of that sort becomes your agents and this becomes your server model where VM insert for if you want to insert the data into storage then you have VM insert so this is the basically edge point between your agents entry points for your agents to writing data on your storage this the VM storage one is the central layer where all your data is so what it does is you you it takes data from VM insert it writes data to this and it keeps things in buffer as well so what happens is till the time insert comes off insert takes off the data from this VM storage takes data from VM insert keeps it off and writes it and so this is how your storage work only happens so now your insert doesn't have to wait a lot then VM select is the one that is helping you to query that data so whenever from the fauna you are hitting suppose you are getting any from ql then what happens is uh VM select interacts with VM storage and displays off with the data so this is how you basically go on and keep on segregating the right so we now let's talk about the statefulness of the component the only stateful component out there in the entire architecture is VM storage box very actual data resides you can have pvs attached towards on your VM storage parts and then that that will keep on writing on the on the data and now the best even within the stateful components it's not a single point of view so what it's a distributed system you can have replications configured towards it so that your data is not only on a single disk or a single node and even if a node goes down you still have other nodes to actually serve the data right so this is on the server side of it so I'll probably summarize the points again so all the entire component in that thing is very simple you don't have to mingle around with the configurations and uh that's that's the best ERP that victoria matrix is kind of so so you don't you come you just take uh the victoria matrix uh in a clustered object and you deploy you don't have to mingle around with many of configurations and everything goes smoothly then you have clear separation between writes and reads uh as I've already mentioned so right the writes are taking care via VM inserts so VM inserts are basically doing the right towards the VM storage and the reads whatever are coming from Gapana are coming from the insert now think of a scenario where your VM inserts are down all your VM inserts are down or you'll make the processor zero right but still you are you'll be able to query the older data it's why because your VM select and VM storage are up so it's not like you're you're denying the capability of querying the older data and by the time your VM inserts are coming up you're it'll again start off buffing it through the VM storage you don't lose data the only catch here comes is that the timestamp starts different when your VM insert goes down suddenly so uh how does it work is is because for any remote api actually keeps things in buffer and it uh how especially with promise and everything how it works is you're still keeping promise at this moment you're still keeping promise as I was speaking right so promise scrapes the layer and it uses the remote write api of promise as itself to write it to VM insert now once it's writing it to VM insert it keeps uh there's something called as a right hit lock or wall it works on a wall mechanism and it keeps a track on how much time data it has already given and then it keeps things in buffer so whatever is the wall retention rate uh would uh would be your time duration that you can probably afford to make your VM insert now but that's not always adjustable and obviously and it's not something that we'll want to write and again all your or VM insert your VM selects everything is stateless it can scale independently infinitely and and what not so you run from default configuration no extra fills is required storage ingestion read everything is again and the best drb again uh that comes again from victoria matrix on the storage side is it does uh higher compression so what what effectively it means is that if suppose you are using one terabyte of this space with regular prometheus victoria matrix can store the same uh data in around 300 gb and and that's that's what is the best part of it right that the compression data does is is to perform the highlight some of the highlights is that you don't have to change anything on your uh when you're exposing any metrics or your applications doesn't have to change your exporters doesn't have to change everything is supported all your prometheus thing is supported right and and from the graphana side also you don't have to add another data source of that so you just have to change that endpoint and point it to the VM select uh parts or the VM select load balancer and and you're done so you don't have to change anything on uh apart from that right it supports many other agents apart from prometheus it supports many other agents the graphi json and fs db and you can even insert if i have normal call call uh it's it's vm so we distributed with the application factor as i've already mentioned it that vm storage is distributed with multiple application factor so you can have five nodes with the replication factor of three so your data is at uh your same data is there at the three three different nodes now the question will arise if my data is at the three different nodes how does uh how does it does a deduplication it has to do a dedu and it has a dedu built in so it has a dedu called as dedu mainstake interval now this deduplication happens at two layer one is at the querying layer this is what i said because of the replica factor takes your internet what happens is suppose now prometheus doing supported cha by default so what i have to do is you have to run two different prometheus at parts scraping the same data now if you're scraping the same data and pushing out the same victoria metrics the victoria metrics on the vm insert layer has a problem because now it has two uh similar metrics at similar timestamps now this is not possible right and this is why you have a feature called as dedu mainstake interval which basically means that uh it will check the timestamps of it it will take that okay two metrics are there which are similar to each other uh and what will happen is it will master timestamps the timestamp is less than this uh particular configuration then what will happen is it will drop one of the metrics and store only one of them uh only one uh metric component so this way it does a deduplication factor as well and it also supports multi-tenancy so there are multiple organization which has multiple teams multiple uh business units and whatnot right so you can and if you want to do uh segregations within the business units within the team within the project ids you can do them via uh multi-tenancy uh support which via using namespaces or tenancy okay so till now what we have done is we have uh taking care of the storage side of it and the querying side so now uh our storage is sorted and our querying is sorted right you you query it with a different component it's stored with a different component but we haven't replaced the agency our agent is our agent for scraping things is still the Prometheus uh it's it right so what what happens is that Prometheus even if it's scraping a lot of targets when when I talk about a lot of targets suppose you have a microservice that runs around 50 60 pods and and it is monitoring Prometheus and monitoring hundreds of these kinds of microservices which are running all these number of pods then what happens is your number of uh metrics uh individual metrics keeps on increasing and once it keeps on increasing Prometheus even on scraping takes a lot of memory and a CPU basically takes a lot of resources so it becomes very difficult to manage uh uh post post one point so that that is where you can replace Prometheus with VM agent now what VM agent does is it's it is a direct replacement of Prometheus you don't have to change any of your configurations you don't have to change any of your exporters you don't have to change anything just replace Prometheus with VM agent and it works like anything it works smoothly right so uh the reasons why we have to replace Prometheus uh with with that is because uh Prometheus is a resource as we have already seen right in the previous uh issues as well and there are multiple such issues that have been reported because of the resource heaviness of Prometheus and Prometheus is going out of memory it's crashing and all that stuff uh there was one recent uh there there was one uh scenario that we face a lot of time when the number of metrics increases was wall corruptions and Prometheus uh if you go towards the Prometheus documentation it itself says that uh it when you're doing a remote write feature or a remote read feature of Prometheus what happens is that it it starts taking uh double or triple the memory of uh what it will take without a remote write feature right and since if you want to use from uh Victoria metrics as a server it becomes very difficult because it kind of utilizes the remote write functionality and that means a lot of wall corruptions a lot even more memory utilization and hence the cost associated with it they have been uh what what happens is that they have been delays in pushing metrics as well uh at some equations where you get network chokes and and things of that sort we imagined on the other hand was very lightweight it uses like uh i'll give you a scenario when we are using Prometheus uh for just for scraping uh for for multiple tasks we had to spend uh around we have to have eight or nine different nodes on which all your Prometheus were there uh just because uh there were multiple pistachio units and everything uh so nine different nodes you were having Prometheus when we replaced it with VM agent only two or three nodes for more than enough for it it was like uh even then we have to reduce the instance type for making them uh so that because the resources are not being used so this is the amount of resource consumption difference that they're talking about right it's it's three x or four x of the difference it's Prometheus compatible so you it can directly integrate with all of your exporters and everything uh it's stateless uh so as we have already seen that victoria victoria metrics is storing data on the VM storage layer so there's no uh storage that you have to do basically on your uh VM agent right so it's stateless so you can have any number of replicas for for a single uh object and then you can have duplication supported on your insert so that none of your data is duplicated and then you get rid of that uh gaps that you get in the graph and that have shown right then comes a VM alert so Prometheus again as I've mentioned earlier here in this part is you divide things into three different components so we divide things into storage and querying we divide things into scraping and then the third part is alerting so we have already uh discussed about how storage looks like how querying looks like uh how do we scrape it uh via VM agent and then now the last part comes off into alerting which was still handled at the Prometheus layer for this right so what happens is that this is replaced by a component called as VM alert what it does is it it basically computes all your alerting rules and based on the alerting rules it fires then it sends a signal to alert manager which basically in turn sends your notifications to your Slack channels and then you can within the alerting rules you can configure different types of taggings different type of way you want to have runbook url you have to run uh have your grafana url and what not right everything can be configured uh so VM alert uh is basically evaluates the alert again pictorametric and will send a firing alert to alert managers right cool let's let's talk about scale right what what scale are we referring to so what we are referring to is 2.793 trillion data points uh at any given point of time right so that is a amount of data points that if you do any single query that the amount of data points that it has to basically query back and see that okay what is happening and and give you a back fix at this right and our data points are growing with the rate of 958 uh thousand every second so every second this these are the number of data points that I extra uh added to the server right and and each data point have has a byte of 3.33 bytes every data point now the best part is that even with these many this much of our data you have only 782 uh gigabytes of storage that is stored now note that this storage is across five different VM storage uh parts and with the replication factor of three so that means this this entire storage is actually thrice of of the compressed data right so what happens is you have a replication factor of three so that is there's a level of compression that pictorametric supports and it saves a lot of disk space and whether you talk about uh any uh storage type right when you talk about uh ec2 ssd or anything of that sort right there's still disks they're not ran right and and with that that means it is slow so if you have a lot of data on your uh file descriptors via if you did a lot of uh data via file descriptors then definitely if uh that will slow down with an amount of uh with the amount of space etc with the amount of data that you have and we are making network calls for sure now this is some something that will fascinate a lot of engineers and would have uh like every company wants to save costs right it's it's not uh every penny is hard on right so you have to save that thing and and they have been costing meetings in uh almost all of your organization now how how much do we spend for this entire infrastructure right you have uh the best part about uh about VM select VM insert VM alert everything is stateless so that gives us the flexibility to run them on spotnotes so we run uh all these components on 100% spotnotes with a variety of spot instances and with HP enabled so our minimum configuration is you have only two parts running at any of peak hours and based on the load if suppose a person is querying a lot more people are querying in the daytime right more people will be opening your dashboards at the daytime uh during off and off is our cell so that that time what will happen is will uh the HP will trigger in with scale of your VM select VM select parts and that's how it happens and since everything runs on spot it hardly makes a price difference the only component that runs on demand is VM storage is because it's straight forward although it still can run on some percentage of spot uh it depends on on thus how your spot architecture looks like so I'm assuming that we are having uh five VM storage nodes uh with all on demand instances then we are spending around 13 odd dollars a day uh with three spot nodes for all the other components uh that that too is also based on uh if you want to leave headrooms towards other components or other parts to come in and stay back and and for HA kind of thing you don't want an EC2 instance to go down and keep off it will hardly spend a three uh three dollar on and then suppose you are having HA on your VM agent uh parts for scraping things off then also the requirements is allow 500 uh MBs and 10 millicores so this is the requirement of scraping things off which even uh I didn't get a proper uh like instance for that particular kind of thing since you do not use T2 it still have taken a counter to the max to max market order so the total spend of uh of supporting these many number of targets these many number of data points the trillions uh data points you are spending only 18 less than 18 dollars a day which is very nice and you have you don't compromise on availability on so it's uh whenever we talk about cost everyone says that uh cost is availability will be uh uh we'll take a hit if you talk about a lot about cost so that that's not the pace here what we have done is we have made each and every component as HA we have decided that we have having multiple replication factors everything right we have replications across your data storage nodes you have multiple VM in search of multiple VM select you have HA on the VM agent processor and with all those also you are spending only a fraction of amount what you will have been spending with the same amount of data same amount of varying will be spending a lot more on a vanilla prometheus architecture or even if you talk about Thanos if you're talking about cortex it takes a lot because uh because of the complexities of the architecture that it will start building uh I've had some of the references uh on uh on the things so there's been a very uh good block on uh media uh for for a comparison between a valena prometheus victoria metrics benchmarking and uh it also talks about Thanos and cortex uh uh another hood as well so do check out this block uh it talks it it probably shows you with proper metrics on storage your memory cpu and all your consumption that it is seven x times bigger and this is a claim on their own uh on their website as well and on their documentation that uh within via this benchmarking they have proved that it's seven times faster than that and uh apart from so whenever we read such statements right we feel like it's it cannot be true because uh very very actually implementing it out your own research it tends to differ based on your uh hardware that you are using based on the architecture that you are using based on the metrics that you're exposing or but out of my personal experience as well uh vm agent have or vm victoria metrics as a whole on all the components have done us very well and uh it probably has has actually proved that it is one of the lightest company uh that is available for the primary infrastructure uh yeah i'm pretty much done with my uh things i'll leave the next five ten minutes uh out there if you have any questions thank you viva for uh giving such a wonderful talk a very concise uh you know talk on victoria metrics and its benefits or comparisons over other toolkits uh thank you for you know coming here and talking about it uh so we have some audiences uh both on youtube live stream and here on the zoom call uh if anyone uh would like to now pose a question uh or maybe a couple of questions you can unmute yourself uh and then start asking uh viva uh something that's relevant to the talk he just uh you know uh presented so yeah so i see some folks here i see srikanth raki avinash uh any of you would have any questions around the current talk you can unmute yourself and ask uh viva directly i'm sure you'll try and answer most of your queries if nobody else i have a question i wanted to ask you how uh how much time does it usually take to set up the complete components for running from videos and getting things uh up and running for monitoring and which are the most time-taking components because uh the concerns people have apart from running from videos is like it takes them one and one and a half months to set things up right so so i'd like to know how much time did you take and which components for the uh most time-taking if there is anything that could be improved to uh cut those time out um sure thanks so thanks for the question so there have been a lot of things when you talk about your setting a monitoring infrastructure for and it takes around a month and one and a half right so there have been a lot of considerations that folks should really make uh while making this decision uh if you talk about a vanilla setup of uh of victoria matrix and all all the victoria matrix agent and everything it it probably takes uh a couple of hours not more than that uh to actually set up the entire architecture what actually uh takes a lot of time is uh to understand that so when you're doing the helm uh installed from that uh using the helm chart of victoria matrix and when you install all this component what generally happens is it installs the engine servers on top of engine spot as well so it takes good amount of time to understand okay what is your organization idea and what is the endpoint that with which we'll uh insert to do a remote write or a remote read from graphan so that is one uh point that uh that the first point that you can search it on on the documentation the documentation is pretty good it's it's decent enough to have all your details uh and it will not take more than a couple of hours or maximum a day to set up the entire architecture whatever we have this question okay so what about scaling things up like uh reaching a scale that a normal organization can use uh uh is a capacity planning one of the time taking things or is exporting matrix one of the time taking things any uh other stuff that consume time i would not say that uh a couple of hours is a total time uh it it takes for an organization to grow from zero to having modern setup right correct so what are the components there and uh what are the time consuming time things here sure so uh for for capacity planning what victoria matrix has is uh okay victoria matrix has some very good uh thing on on with reference to your capacity planning it completely says that okay how much data that you have and uh but for any startup organization it becomes very difficult to understand that okay how how much metrics i'll be committing right it took us considerable amount of time also to say that okay we are we are emitting these many metrics and this is the granularity at which we are emitting so that obviously keep that there's no straightforward formula for uh getting those things uh it's about how do you uh keep on tracking things and you increase uh scale you scale as you go uh kind of thing okay so it's not you'll throw off petabytes of data in a single in a single shot right we'll grow individually and uh what you'll have to grow is within victoria matrix if you say is that you'll have the only component that you'll take take care of growing horizontally is your uh bm storage node which which has some hassles because if you're increasing another node of em storage you'll have to add that node on vm in certain vm select as well that is one mistake that we did during our setup so there was a scale we were actually hitting uh the we were actually hitting scale and uh we were having resource trenches over there as storage nodes were not responding properly so uh we did increase the storage nodes but we didn't add it that those nodes address uh on your selector and your insert and and and this is uh after some debugging and all we we've got to know okay this is something that you'll have to keep note uh because vm inserts and vm select doesn't uh kind of does a service discovery and it doesn't use a service model to actually uh identify all your storage nodes uh but you have to explicitly provide them with cross dns end resolution for that via headless service cool that was interesting so uh yeah i mean sort of a corollary to what anki does like what happens on bursty workloads what happens if tomorrow you release a new service and at this particular service suddenly starts seeing a very very large amount of traffic let's say no no how does vm storage uh scale when suddenly you go from 3.3 billion events to double that maybe over the span of a day would the storage handle that sort of bursty load or would it start dropping samples and you know suddenly you have a lot of blank canvas to deal with right so what happens in that scenario so uh what what happens in this kind of scenario is that uh you you have a component uh above your storage nodes right on both your querying layer and your uh inserting layer so what what we are uh kind of expecting with this question is on the insert side of things so you have an insert so what insert parts does it it keeps off a buffer between uh sending it to storage nodes so it keeps things in thing on buffer now the drawback of this uh this particular uh scenario would be that whenever you're querying uh data on your vm storage nodes via via vm select you'll obviously get you'll be getting uh differences in your amount of data that everything so the timestamps so suppose if the last if it's double the thing and if your vm storage is not able to handle that part or it's writing very slow then what will happen is the next you'll start missing your last five minutes of data and then you'll see all your data so it'll take uh some time to actually fill back that data but no none of the component will go down directly and uh because vm insert will take care of keeping things in uh buffer and uh it it will only throw off it will act as a querying system you can say that so uh you think of it as a Kafka queue or any any queue where it you uh keep on pushing messages and it's it basically pops off the message as in uh when the consumer is allowed to uh is healthy enough to take that message now but that definitely comes with a drawback on on that part if that load is too uh extensive and it's uh out of things you'll of the vm selects will not be able to see you the recent most data that uh you are expecting out of uh a query that that's the only drawback cool yeah i think that answers uh you know most of what i sort of tried to ask but i still see uh in real life you know there could be situations where things and but like you can always capacity plan of a new service to a certain extent but uh real world events always go ask you of uh most of our best late plans right real world skills uh are always always so different than what we can plan for uh but but still we can at least knowing the way this workload has been laid out we can have some plan which sort of answers uh you know if i was doing this how would i would want to sort of handle this right cool i think with that since i haven't received any other questions from the audience whether on uh you know the 1015 people watching on youtube or some people here on zoom uh you know uh let's just wrap this up and Viva i would love for you to hang back oh we've seen nabarun also here hi nabarun uh thanks for joining me i appreciate it making time on a saturday morning uh cool with that we'll wrap up by first stop Viva please uh you know feel free to hang back and we'll probably have like a small bandage station uh with a couple of us last time nabarun i and a couple of other folks got into a separate call at the end of the meetup so this time we wanted to make space for that i spoke to hasgeek and now we have a bandage station uh which will be uh at the end of the next talk by peyush so feel free to hang back and uh to the audience also feel free to hang back for that session right here because from my personal experience uh the best conversations uh happen at the tail end of any conference where people are just catching up uh with each other on the hallway right and you know all of us have missed that for the entire duration of 2020 with all the physical meetups that have stopped uh meetups have become more web in us one way communications and most people here i have worked with ankit i have spoken to nabarun all of us have spoken and agreed that that is something we need to start incorporating in most meetups where things become more dialogue heavy and not monologue heavy right uh those conversations needs to happen so with that thank you so much Viva for you know doing that brilliant talk and answering most of our questions uh over and out to peyush who will start with his uh observability or one one talk uh sort of uh and uh ki has some amazing case studies to discuss also as he explains the whole concept so yeah hang back and enjoy the talk okay over and out to you peyush uh peyush if you could like just briefly introduce yourself to the audience that would really be great uh i would love for you to turn on your video feed but if uh that's not something you want to do that's also we are perfectly fine with but yeah uh actually i would turn i've already turned my video is it uh are you popular to see me uh i think we just are seeing your uh icon with no no i'm yeah yeah good yeah i can see you good awesome hi hi i can thanks for having me and uh thanks for having me so that insightful talk i think uh uh so i suppose uh just to introduce myself i head the engineering team at capri technologies so we provide a cloud-based SaaS products to uh retail in the uh large retailers and other you know consumer facing businesses to engage better with their end customers so we are presenting about 30 countries at this point and uh working at a fairly decent scale we uh we touch almost 650 million consumers across these other countries and you know processing close to 15 billion line items every year close to three to four billion transactions every year it's working at a fairly decent scale you're based in baguio the small address center in Shanghai China as well and thanks for that insightful talk i think what caught my attention was that $18 per day that you're spending like someone who's always looking to optimize my engineering is when which i think that's definitely caught my attention and i'll be reaching out to you to uh to kind of understand slightly more in detail right all right uh thanks for joining so the agenda for my talk is going to be fairly straight forward it's more of a one-on-one country talk on observability just to you know i'll probably take an initial time and just to set a context of how this whole uh you know the whole drive and the the whole uh the concept of observability came into being in the recent five to six years right how the trend has progressed and what are the three core pillars of observability and how they tie up into the other related areas of monitoring alerting as well and also i think i'll in the towards the end i'll obviously spend about a couple of minutes on the recent emerging trends in observability space like beyond the basic your uh metrics monitoring and alerting right and after that we can open up to questions and like uh discussions with the audience yeah let's just dive in right so i think as many of you folks know right release to production is as the beginning of any of the lifecycle of any software system of any product right so there have been very many many studies in the past and i think from evidence as well like 40 to 90 percent of the total cost of a software actually incurred after launch right so the maintenance the keeping the uptime keeping the performance healthy pushing in bug fixes patching right so it's a it's a significant amount of cost that goes beyond just the production deployment or the first lease of the system right and as systems are becoming more and more complex more and more distributed more and as as as you know rightly said by a very famous vc that software is eating the world right so software is kind of passing more and more parts of our lives uh the the the the consequence of that is as systems fail the the impact of that on our day-to-day lives is also becoming more and more prominent right so and as systems become complex they will fail right there is no i mean there's no doubt about it no matter how hard you try to build scalable uptime systems they will eventually fail right you will be toward the tail end of your availability uh numbers right if the system will fail eventually right and as one of our goals amazon cto keeps saying that everything fails all the time right number of failure points in a distributed system will increase with every new component you you add to it the more number of parts the more number of uh request components interactions are increasing right the more the more chances of failure you are introducing in your system right so people say that you know if why don't uh we can have a very strong monitoring uh framework you know very strong monitoring on on on our systems to kind of alert us when something's are going wrong and we can just go and fix it right but the question that comes to mind is what do you monitor right so i think that's why we probably just do a very simple case study again something which most of you folks would be very familiar with right let's just take an example of a very simple restful api server you're serving the rest api over http probably using json as as your request response structure right what can be what can be the potential points on which you would like to monitor it could be your api latencies nine to the percentile average 99 percentile uh if you are running on on on containers or even on your vvm your cp usage your load average your memory swap if you meet the database application your heave usage your uh your http error codes exception rate failure rate if you're dependent on external systems and whatever latency the external apis which could be impacting your application latency as well and many many more right so i mean these are just a few high level sample points that we could potentially monitor on on on this api application right and you may want to multiply by that by the number of servers you are running because for a tendency purpose it's obvious daily obvious you would want to have more and more servers running uh with the same api server code this and if you have into disaster recovery and you want to have a backup code or you will see you're running in hot cold kind of a setup you also have the multiple error the number of deployments the number of classes that we potentially have right so just a simple back of the envelope calculation will tell you that you know you are looking at not close to 400 to 500 or maybe even 1000 separate data points that you may want to monitor essentially right and this is just for one rest service that you could have at the top of your stack there could be multiple such services in your overall software deployment that you have right so it's really obvious it does something does not look right right i mean you you can't have a monitoring system which is observing like thousands of metrics of across so many servers and applications running right i mean you you you will go crazy you'll have to update deploy a much much larger team just to monitor and watch these metrics essentially right let's just step back and understand what monitoring is right so in in theoretical terms what monitoring means is you know you're just capturing the state of systems who try to determine what what is the health of the system right so a couple of a couple of methods that are commonly used is you know you have health checks running just to see whether the service is actually up and running and secondly can you send work to it right so the if you in the container ecosystem you have the liveness source to check whether your service is running or not you have the readiness source to check whether is this application ready to accept more work or not right and you could have metrics system application and function we probably do a deeper dive into this in the subsequent slides and once you have understood these health checks and metrics you can define anomalies on it like for instance if your your your litencies are going beyond a certain threshold which you have said as per the acceptable number for your consumers you can define an anomalous behavior on top right and alerts are usually set up on known failures basically the things that you already know can go wrong and you create find a threshold on top of it and you define alerts on top of it right it's a fairly knowledge-based process it's a reactive post outage process you have an outage you realize that something was going wrong and your system was not alerting on it so you end up adding an alert on top of it so that in future you can catch these you can catch these anomalous behaviors for active right but what about the unknown failures right I mean you can always you can always apply alerts on things that you already know could potentially go wrong right but as the complexity of the system is increasing you're becoming more and more distributed you are adding more and more integrations and dependencies the number of unknown failures are also increasing right so I think that's the time you know observability kind of comes into the picture very nicely right observability helps you to kind of keep keep a tab on these unknown failures essentially right so I started my career way back in 2007 and 2008 I started working at Yahoo so at Yahoo the dev team had zero access to production systems and the the the message that you see on the slide is something which I have lived through for almost 100 to three years once my code has gone to production I did not even buy time I live whether it's running or not it's handed over to the dev off team let them manage the systems even if it comes under file you know it's a it's a crash and burn kind of system it's not our problem because devs don't have access to production systems right but in the last decade or so the way the industry has progressed the pace at which the developers are building software this this barrier between dev ops and dev is getting thinner and thinner right and this is one of the I would say you know one of the very popular and markable posts by copy transfer I mean a big fan of writing right so she she posted this article in 2017 it's a highly recommended read right observability is you know because they don't like to do monitoring you need to package it in a new nomenclature to make it palatable and trendy all right so it's more of a lighter it's a lighter take on what observability is but yeah in some form you're essentially taking monitoring keeping the systems up and running and the operational aspect of your infra closer to their developers in some form right uh barren shorts one of the one of the well known names in relational databases one of the chief architects of parkona and he also you see your vivid cortex i'm not sure if you folks have used it again a very very nice tool to monitor and keep your databases up and running so he mentions it in a very very clear and elusive way that monitoring tells you whether system is working observability will help you understand why it is not working right that's why the whole as as this system ecosystem is becoming more and more common observability is picking up at a similar phase essentially right so slightly theoretical definition what observability is it's a it's a measure of how well the internal states of system can be inferred from the knowledge of its external outputs again might seem too too fancy or too verbose but the key key items over here is you know you are trying to measure the internal states and by looking at its external outputs and you're trying to infer what's going on inside your application right so uh i kind of consider it to be you know very analogous to medical diagnosis right for human body to diagnose what's going wrong inside of human body doctors just don't go and keep start opening up your body by you know by doing surgery and all right they try to infer what's going on inside your body by looking at the external output it would be your blood pressure would be your heartbeat rate would be your temperature would be your pulse rate and all those external signals that your body is sending out right so essentially observability of software systems is just an analog of that essentially you're trying to observe what's going on inside and software system by looking at the outputs that you're emitting from your from your applications essentially right so what what could be the potential internal states of an application now they're they're very very context specific for instance if i'm talking about a web server the internal states would be characterized as the availability and the uptime is the server actually responding to my request would be the incoming request rates could be the my response latency that should be failure rates application services you know it could be my functional success rate failure rates for message views it could be my queue lens number of active producer consumers at any point in time okay so there these are fairly context specific states that you may want to define depending on your application architecture and the key point here is that to measure the internal states you will have to instrument your code while you're writing it right it cannot be an afterthought so that's something i've seen very new engineers doing when they are writing code right they finish the whole application development when it's ready to go to ua or to move on to the next stage in the development process they will start to retroactively instrumenting their code right that's that i kind of strongly discourage that right while writing code think of that is this a point where i want to measure the internal state of my application if yes instrumented right external outputs that's where come the three pillars of observability right metrics logging and tracing the health checks is also a form of observability essentially but i i kind of club it under metrics but again it's fairly fairly fairly easy so feel free to club it wherever you want but yeah for the sake of discussion i consider as part of my metrics chapter let's go let's spend a bit of time in each of these three pillars right what a matrix right the metrics are essentially nothing but the external state which you're measuring at a broad scope when i say a broad scope it's nothing but a time dimension because when you say a metric you're looking at you'll always look at metrics with the time as one of the access for instance what was the performance of my apis between 8 a.m. to 8 a.m. to 9 a.m. ISP right or what was that what was the what was the incoming request rate at particular slice or a bucket of time essentially right so you'll always have a time dimension or time access available to you when you're talking about metrics right could be system metrics system metrics are nothing but your seat utilization your memory swap other other network other system parameters could be your application metrics your success rate failure rates you can see the records again these are all very applications specific and subjective things which will be dependent on your application architecture and your design and largely not lastly your business and functional metrics these are likely more relevant towards your business and your product managers right you're you're looking at your order rates in case of payments it could be a lot of successful power payments reverse payment reconciliation requests integrating some opening system to be a lot of coupons you're issuing number of coupons you're getting the team successfully right again business and functional metrics it's good to classify the metrics into these three pockets they help you to kind of design this metrics better and also to create your task codes as aligned to your the consumers and stakeholders of these metrics essentially right metrics should what should be the idea of deciding what to emit what not to emit I follow a very blanket rule of thumb you know be as generous as you want to be when editing your metrics some if you want to get you want to read some common guidelines around how do you design your metrics I'd highly recommend this article on metrics that matter it came on ACM which is a very very nice framework how you decide whether this is a metric that you want to track on essentially right so be generous on metrics however be judicious on alerts because again you don't want to alert pretty going back to the point we are discussing in that simple case study that you know you don't want to alert on 1000 different data points essentially ready or you're you're not team or you're on calls you'll kind of go crazy if there's a lot of alerts you will be generating essentially right another common mistake I have seen people doing is that if they end up stuffing a lot of dimensions of data points on their metrics right so it's a very it's a highly advised that your metrics should have low cardinality on the on the metadata that you're attaching to them right so avoid attaching user specific IDs order IDs or entity IDs inside a metric so metrics will always be providing your system summaries in some form and they will help you answer questions like how many transactions failed how many logins succeeded how many orders are being processed how many payments are being processed so they will help you answer how kind of questions essentially right so that's another a simple way to decide whether there's something you would want a metric on essentially on your system yeah common tools available to capture metrics from it is again the above touch the point it in a deep dive influx db then fairly fairly popular tool timescale db one of the recent editions in the last two to three years graph it is I think it's one of the old grandaddies of your timescale databases although I'm not a personal kind of graph right with the data model very flattish and not very intuitive to people who come from a sequel rich kind of a background open tscb again fairly old timescales science scale database they'll run on top of edge base scuba is again a very nice architecture problem propagated by facebook it's very nice paper highly recommended so if you want to go deep into the algorithms that run kind of optimizing the each and right parts of a metric store apache druid people are using it for metric storing which largely an event storage event analytics and graphana for visualization right and one common common pattern in common algorithm or speeder sector you will see in these metrics so that they all use log structure in mergers because metric databases are metric stores are by nature supposed to be extremely extremely scalable on rights as compared to beats because you are writing at a much much faster rate than you're reading it so I think they will also touch upon you know they're touching close to 900,000 metrics per second right so obviously the beat beat would be of a much smaller scale than what their right skills are right right uh super star essentially right coming to health checks right as I said is my service running in container world live nice probes can I send work to it is my application ready to respond to the readiness probes what are the different ways to collect health checks right so in a lot of p2p distributed databases you would have heard the term classic protocols very popular in Cassandra and react whenever a node enter the cluster it broadcasts its availability and health checks in a gossip protocol right service registry you service comes available comes comes up it's available to serve traffic it goes and restores itself successfully when the service is not uh service is going through a downtime it will go it will end up be resting it from the service discovery some service registry and that's a way it's kind of propagating its health to the rest of the services in the system and the I mean the older mechanisms and a lot of reliable mechanisms as well are doing health checks on via your EFBs at your proxies engine it's again they're very common and have been in use for many many years essentially right coming to the second pillar logging so the way to understand logging is that you know logging helps you deep dive at a much smaller scope and not on the time dimension but rather on an entity dimension it could be a request or a customer or a transaction logging will help you answer by another question essentially why couldn't a customer place an order why did a transaction fail why did a checkout on an e-commerce application fail why couldn't I add a product the scope is much much smaller you're not trying to aggregate things across a large number of requests you're trying to look at or do a dissection of a much smaller request essentially right logging has to be centralized come here comes the log collection and aggregation technologies fluendee log stacks and tons of agents are available out there fluendee is a lot popular in the cloud and native ecosystem logs have to be searchable comes your indexing technologies elastic search and what not in fact index free logging is also becoming quite popular these days so low key is the last rule that you see is becoming picking up fairly it has it has come under the Grafana ecosystem and they're doing some really really kick ass job in kind of promoting the index free logging ecosystem and logs essentially have to be correlatable by a common key even being in a fairly large distributed system your request is touching multiple services you have to tie the ends together and kind of arrive at a common trace of the request so your your request ID is a very commonly used parameter a lot of common commonly available load balancers allow you know you can inject ui d that request id as soon as they enter your your your infrastructure standard tools these days elk side elk paid technologies like Splunk and Sumo logic and low key as i mentioned a new trend in index free logging is picking up essentially right anatomy of a log very simple timestamp levels service commit id is build times right build version number region customer id tracing and all right so having a well structured logs helps you from day one to kind of get the maximum value out of your logging system so many popular libraries allow you to define your appenders and your your logging format so it's a very very common practice across multiple languages and libraries third pillar tracing so tracing helps you to so if you're working at a system you're working in a software stack where you have multiple services each service is responsible only for a very small component of your of your request processing you may want to visualize it and you may want to trace what happened in which what part of your request processing that's where the whole tracing comes into the observability uh observability space right and it actually became popular after google publishes dapper paper i think almost 10 to 11 years ago twitter also came up with a with an open tracing system called zipkin then the community of developers got together they published a standard called open yeager which we again being very popular and with with a lot of your application to how much monitoring tools like neural lake data doc or app dynamics tracing comes out of the box essentially so it's a very very useful tool for you to turn a zoom in into which particular component of your request processing is slowing down how much time is it taking and the whole concept of spans and traces comes into the picture right so if you look at the observability spectrum right so you will have health checks and some form of metrics which are allowing you to catch known unknowns and then comes your debugging and explorations uh space where the other part of metrics help you to query to understand what went wrong you're trying to discover the unknown unknowns right but racing and logging kind of help you in the debugging exploration part your health checks and some form of your basic health metrics help you for monitoring and resiliency essentially right we have discovered we have discussed metrics we have discovered logs we have discovered tracing right but coming back to the problem we started with what do you want to alert on right comes service level objectives one of the terms which has been popularized by google sari uh teams a lot essentially right so just folks be here with me i'm not might sound too much fun at least uh if the audience is fairly mature it might seem repetitive but yeah for the newer folks slo is a very nice concept i think something which i promote very very heavily in my teams right slo is a nothing but a very simple quantifiable and measurable goal for a service and that goal should be linked to the user experience for a delight factor right so as engineers we we take a lot of pride into building scalable systems but at the end of the day we have to think that you know there is a user who's trying to derive value of the system that you have built right so slo's help you tie up the technical health and the technical factor of your systems to the user experience and the life factor essentially right and slo's would be something which you are defining before you start writing code right you should work backwards define an slo first and start then start writing code of your service right and have as few slo's as possible and which are representative of your system behavior uh you can probably get a more deeper insight into slo's by looking by reading through this article or looking at this talk on youtube let's let's do a simple exercise on defining what could be the slo's for let's assume that i'm writing the card service right i'll probably open uh open it up for comment from the audience at this point right let's assume that you are responsible for writing a service which is managing the card on an e-commerce portal right what could be the potential slo's that you would define for such kind of a service maybe you can you can pour in your comments or you can pour in your answers in the comments that works or shall we open it up i mean if somebody wants to speak out they can unmute what do you want to do it yeah so as usual as i said earlier in the previous talk also folks watching we have around 10 people watching on youtube livestream we have more people joining the second talk who are all here we have heard your like thank you for such an insightful again talk while this was more of a 101 but sometimes you know as you correctly said right you need to start from scratch and there are always people coming into the practice new they have to learn first principles way right which is something we always talking about right just doing fancy tools probably is always not great you need to understand how to use that data more often in the right way uh to arrive at your decisions right uh so yeah with that in mind to finish i'll probably just take a few minutes maybe just start upon the sorry i'm i'm so sorry i thought you were opening up for questions i'm so sorry i'll probably make a collaborative thing you know rather than just going to monologue and maybe just let the audience come in awesome sure sure uh sure we can do that format also so yeah folks uh you know since piyush once said that we feel free to you know have a conversation with him and interact with him you can unmute yourself uh as he wants to and you can interact uh as he you know would like to direct this please go ahead piyush sorry i thought wrong apologies yeah so folks again let's say you know you are writing a service or managing a service which does one of these three functionalities right it could be a card service it could be a authentication authorization service right again so we have built a service like opt hour or zero or maybe you have built a gateway communication gateway right sending out sms's email post modification right so what would be the potential slo's you may want to design for these kind of applications so open for thoughts folks i mean feel free to send us comments as you want to unmute and speak so let's say i will pitch in here uh you know since one folks are still deciding on stuff like i'll go for the second one right authentic like one important slo heroes we would be that uh out of let's say a number of authentication requests how many are getting successful right and what is the latency right so those are sort of you know base metrics that we need to and how many off failures are happening you know they could happen due to wrong logic they could happen due to unavailability of service they could happen due to an age case in the authorization logic because more often authentication is a simple uh authorization is where it becomes you know much more complicated for people to actually plan out the whole workflow and execute right so those test cases those age cases uh how we are navigating that should be you know our slo's for those two services right right right absolutely i think that that makes a lot of sense and the idea is that you know slo's are essentially representative of what your service represents right so what your service does essentially right so now right i mean just going back to that sdp example right my my you know my cpu spy can actually cause a high latency which can cause a failure in my authentication operation right so there could be thousands of reasons which can actually cause this behavior to kind of deviate from the expected norm right so but having a tight watch on the expected behavior helps you capture multiple failures into a single metric essentially right so having alerts on the slo and the sli objectives makes a lot more sense on a much common on a very complex application right so having alerts around slo's makes the job a lot easier right you can always use the deeper level metrics for analyzing and troubleshooting when the slo is not getting met but thinking of an slo first approach has it's a lot in you know in terms of designing your architecture designing your routing and monitoring in for in in much more easy and essential right yeah yeah i mean it should be like business metric driven right like slo's are mostly business metrics and then you drill down to deeper system level stuff more always level stuff right right similarly for a card service it would be you know the number of products are getting added to the card number of quantities i'm able to change on a card number of successful checkouts i'm able to do right so those are the slo's on which i should be designing my some of my metrics and essentially my alerts as well and similarly on a company system could be the number of successful sms is i'm sending out number of incoming requests right number of successful uh messages submitted to fcm or maybe send grade or some of those things right rather than worrying about my lower level nuances of the the email payload sizes or anything of that sort essentially right yeah and lastly why why should we spend time on observability right obviously for some of the really you know technical and operational aspects that you can build scalable systems you can do better capacity and load planning it leads to a lower mean time to repair but besides that you know i've been running large teams for a very very long time one thing i realized is that you know having observability as first-class citizen in the team you know provides a lot more data driven culture right you have all the conversations get driven around what metrics have you added you have such entire sports should we have sufficient alerts on the slo right so the whole culture kind of becomes a lot more data driven your teams are talking in terms of data rather than subjective terms like is your service healthy can your service handle my scale right and things become a lot more data driven right your outcomes become a lot measurable and remove subjectivity in a sense that i have a bunch of architects on my growth right i give them quarterly goals that here is a service which is operating at a 700 millisecond latency on 10 000 rpm your objective for the next quarter is to you know improve ensure that 10 000 rpm latency comes out 500 ms they have a clear measurable goal no subjectivity right someone cannot come back to me and say that you know i did i b i changed the whole data structure i changed the whole algorithm i optimized the code if the vk has not moved essentially something has gone wrong right either the hypothesis was wrong or maybe the implementation was wrong right it brings a lot more accountability across second-order teams i have had so many board managers coming to me saying our experiments are not working because the implementation is unstable it is buggy right the way i solve it is that you know when the tech team rolls out i have the product managers to give me three kpis which define that implementation is successful if those kpis the next analysis is completely on the hypothesis of the product right was the hypothesis correct was the data analyzed correctly as the tech team delivered the feature in a stable way so the priority can then go and conduct their experiments successfully it brings a lot more accountability because you're talking to data violence of activity there's no emotions they were into right pure data-driven conversations right also it's i think very given that how deep in the stack develops and observability is it's very easy to lose back of you know that you are actually doing all of this for a very different product right it's not it's not about a single server right it's about what impact it's having so having SLOs help us sort of prioritize the product thinking i think for me you know as an engineering leader a manager for me you know the accountability aspect is something which is helping a lot by bringing observability right so the engineering groups kind of commit to agreeing on SLA or SLO numbers to each other right let's say you're a consumer of the authentication of the service i can clearly go and tell them that you know i foresee a traffic up at least 5000 rpm coming in the next week in the next coming quarter can you guarantee a negative percentile SLA of 50 milliseconds to me so that i can commit a 100 millisecond SLA to my to my consumer of the services right so this level of accountability becomes much easier when you have observability as part of your culture and part of your implementation right talking about standards again given the plethora of you know metric tools metrics collection technologies coming into the picture the developer community came together started defining standards which can be followed and technologies are buying by so open metrics one of the standards which came about three to four years ago standardizing the structures of metrics how do you emit it so that you can replace the metrics back and seamlessly and open telemetry is one of the recent standards which is getting a lot more adoption it is a it's basically a combination of the open sensors and the open tracing standards right so they have defined very clear apis and sdks to cover all the three pillars that we discussed right metrics tracing context is nothing but adding more logging and annotations so it's context is a generic broad term on top right and abiding by open telemetry standards makes you open telemetry apis are and sdks are essentially open source they're vendor neutral so you tomorrow actually we want to replace your metrics back in from promise yes to something else so maybe you want to go away from proprietary vendors like neural link or data document open source to like promise yes having an open telemetry implementation will allow you to do that seamlessly right and it also avoids a lot more logging like if you're not logged into a vendor and you don't have to commit to a long-term usage of a particular vendor essentially right and one of the recent trends in observability beyond application metrics monitoring and all is the data observability there's something which I have started looking into in the last few months right with the advent of you know large data systems and companies like snowflake coming into the picture who are actually you know making data analysis and analytics at a much much large scale very seamless and commodity the data observability has also become a recent trend the idea is since systems are becoming more and more integrated you're collecting a lot more data from multiple sources the quality of data the correctness of the data the availability of the data the freshness of the data has become a lot more important because your data teams and business teams need reliable data right they need to be they have to be sure that the numbers that they are looking at they are actually on accurate data right so a lot of startups have recently sprung in this space in the last 12 to 15 months right actual data is an Indian company Monte Carlo Soda right there they're actually building some really really nice good platforms for observing your data very very closely right so the five pillars of data observability are you know freshness volume schema distribution and lineage there are some reference articles you may want to read up right so freshness simply means is my data up to the date is my data two days old three days old like can I can add the data that I'm looking at the inferences analysis I'm looking at are they recent or are this likely state right volume could be I expect to ingest at least 500 GB of data on a daily basis suddenly we see a dip by 200 gigs right what happened is my volume reduction organic is my volume reduction could be for the whole bulk right schema essentially means since I'm connecting to multiple systems these days a slight change in a schema or a JSON format in one endpoint could cause you know my whole data pipeline to crash essentially ensuring the integrity of the schema across different integration touch points right distribution means again how you're tracking the data pipeline you have data coming in from multiple sources who are relaying and exchanging data to on different APIs and through different steps in the pipeline ensuring that whole exchange of along the pipeline is healthy lineage again something very very critical for enterprise systems like what capillary has right users data is changing over time a user is updating their profiles over time how do you draw the linear graph of that data how do you derive at the current state of a particular entity essentially right so these kind of questions can be answered by data observability platform but again it's a fairly new trend I would say not more than a couple of years old so still a lot more innovation to happen in this area but I feel with the increase of data systems globally I think this is going to be a very very popular or very closely watched trend in the coming years right that's all folks happy observing and some of the references strongly recommend folks reading it even if you are a matured observability practitioner you know lines with these articles will definitely I kind of circle back to these articles every six to eight months to be honest right yeah I'll open it up for Q&A and if you want to reach out to me that's my Twitter handle feel free to drop me a message or anything yeah open for Q&A any questions more than happy to take so you know can I can I close this with something very unorthodox that last about data observability I'm pretty sure there would agree should be its own talk can I implore you even to already like you know doing a talk for us specifically on this in the meetups going ahead because I would personally love to hear a lot more about how we do this right because normally most of our current whatever talks we have had in this meetup has been geared towards understanding systems data right systems state observability we are here talking about data state observability like data drifts schema drifts detecting bugs from observing the nature of the data itself right not the impact of the change data normally we see the impact of the change data and we detect from secondary signals right but you are talking about learning how to detect it from the drift itself which is a much closer to the surface and closer to the actual root cause right and I think that is brilliant so I would really love for you to you know do a full talk for us just on this topic on data observability going ahead so I'm pretty sure everyone here who can will see this talk and who are on this particular slide right now they will all appreciate that talk of like to a great great extent that's sure the talk again sorry I had interrupted you in the middle of it without understanding that you wanted so it was a great talk and thank you for taking out your you know busy time and your weekend to you know present the stock with us and I'm pretty sure even though the first part it's 101 lot of folks here who are getting into the idea of observability who are just into starting to understand the ecosystem talks like this is these are what got me started way back and I'm pretty sure these will provide the impetus to you know folks who are coming over to the domain to start thinking in these terms right the start doing systems thinking start doing state-based thinking start doing observability and you know control system based sort of thinking right and these are very very important for the domain whenever we do talks like this I'm sorry yeah so yeah with that I will wait for you know okay Shivam on YouTube is saying great talk thank you Shivam for you know taking out your time to be with us today and Saturday morning appreciate you being here and anyone else if someone has a question for Piyush he's going to try and answer feel free to ask unmute yourself and you know just ask the question to him directly right away all of that will be updated the both the videos of both the talks will be properly edited by the haski team who are helping us and thanks to the haski team for you know helping us with organizing this whole thing on their platform they have been a great help to this meet up since we have transferred here from the other platform and things are much much smoother right now so they will edit it out we'll have to talk separately uploaded on our YouTube videos the speakers here will share their speaker deck or any other files where their ppt's will be and it will be shared with the audience and you can also ask questions later on if you do not have the question right now if suddenly you look at look back at this talk couple of days later and if you're someone who is watching this talk at a later point of time not live right and feel free to go to our site and feel free to go to the haski talk page there is a conversation section there you can just ask a question and I'll make sure that gets communicated to either of this because and I'll try and get them to answer on the page directly so that you get to you know have your query sort of salt right and any questions that you can have you can ask on the platform itself right so yeah I think with that if no more questions are incoming we will close the like official meet up we still have some space for just normal conversations and banter I know this was one of the asks from nabarund last time