 Okay, great. Thank you very much to Devkanu Kalyan's issue for joining this discussion. Thank you to people who participated today. We know it's May Day, it's Labor Day. But let's have some fun and hopefully this will be a fun conversation. I think firstly it's customary to mention that Devkanu is joining from the US time zone and therefore it's got 11pm at night. We hope that this will be an entertaining evening. So we'll take it up from here. Let me quickly make an introduction to what has happened and why we are all gathered over here. Who comes for being a platform for SRE and for Devkanu for Devkanu for Devkanu. Over the years we have engaged in what I can confidently say of any candid conversations on how do you manage infrastructure, how do you look at questions of running teams, how do you look at infrastructure being tied to business and this being the case, how do we look at tech practice, how do we look at teams. So you can learn more about root funds by going to matty.com. root funds and checking out videos in the past. I highly recommend that after this session please go and watch Devkanu talk from root funds 2015 which was on the earth engineering. I think that's still a super hit top that we've had over the years. What we also have today, so over the period of root funds running since 2012 what has also happened is that we've formed a sort of a peer group around some people who've been speaking, who've been reviewing talks and I think among the reviewers present today are Ishu Melota and Kalyan Sundar. Thank you very much folks for joining in. This event is hosted on happy.com. happy.com is a platform to bring deeps from all genres together to have conversations like these, to share experiences and more importantly to learn from our peers. Hopefully some of this will result in us forming peer groups where we can have banter and chatter about SRE. So do keep visiting happy.com to get updates on events like these and on people like Ishu and Kalyan. Having said that, I'll just quickly introduce the format of the event today. We'll start with Dishanu making a short presentation about the foreign state of affairs and his perspective. We will have Dishu and Kalyan respond for about 10 minutes to this with Dishanu stripping in as well. And then we open this up for questions. So feel free to post your questions in the Q&A tab over here and we will take these up during the question and answer session. For those of you watching from YouTube live, we're moderating and watching questions there. If you want to post your questions there, we will pick them up from there and put it up over here for the AMA session. With that, I'm going to go back to my home tomorrow. I just decided to house at night and look at what you have for the day started. But I'll hand over to Dishanu from here. Dishanu, please feel free to introduce yourself and then take questions here. Thanks, Zainab. Hey, everyone. So I'm Dishanu. I work at Facebook on pretty much like at this point, I think I've worked on like many different things but primarily on large-scale distributed systems, network systems and these days I'm working on AI infrastructure, specifically deep learning systems for speech and audio systems. And I think last time I gave a talk on with Haske, it was in 2016 and since 2016, a lot has changed in the distributed systems space and systems engineering space. So when Zainab talked about this event, I thought about doing a talk on not just SRE but Trans-In Distributed Systems and Trans-In Making Systems Reliable. So I'm going to do a short presentation and everyone else like on this call please talk to me and let's make this as Q&A focused as we can and let's make it like as interactive as we can. All right. So when we talk about reliable distributed systems or in general distributed systems, I like to begin with what are the metrics of success. In my opinion, the metrics of success are availability, latency and reliability, availability because we expect our systems to be always be live whenever we take our phone out like we want those systems to be up and running and we want these systems to be as quick as possible to be as less latent as possible because I think latency is perceived, correlates with the user experience and lastly, reliability influences the bottom line of the businesses for which we build the systems. Like if you're building an e-commerce service, I mean if people can't add to the shopping cart, I mean people are not spending money on your website. If people can't send text messages, if you're building a chat application, people are probably not going to be like engaged or going to be using the platform. So to me like when it comes to figuring out like how well we are doing as an organization or as a discipline, like these are the things that come to my mind. And the tools that we have today like at our disposal, the top ones at least which comes to my mind when I think about starting a new project or developing a new system are obviously like, first of all, I think about public clouds and not just having public cloud in the market where we are operating but across all the markets in the globe where our systems or our services are going to be available and then content delivery network until like maybe like five, seven years back like there used to be a very few content delivery networks and the idea of like using a content delivery network was not very prevalent to distribute assets like digital assets like photos, music videos or things like that was not even very prevalent because there were only a few content delivery networks and because of that they were very pricey and hence people were not thinking about such things but with public cloud, public clouds, content delivery networks is sort of like becoming democratized and they are also becoming cheaper and with that I am seeing like a lot more people, a lot more applications using those CDNs which are very close to users. Then I think about structured logging, metrics were sort of there like even like 10 years back or 15 years back like back in the day if you followed the blogs of Etsy like you would hear sort of like metrics and monitoring sort of like humming and I think metrics and monitoring were sort of like grows along with like the DevOps movement like in Sarka 2008-2009 like people started talking about measuring everything that moves but now I think the trend is moving from just measuring everything that moves to like traces and also like to do like structured logging I think there are systems like Honeycomb right now like which sort of made structured logging very popular and tracing has been sort of made popular by systems like Zipkin for example and like a decade back we were still not talking about microservices as much as we talked today and as people have adopted microservices people realize that there is a need to understand like how each of these systems are fanning out to each other when an end user request comes in and that made tracing popular and so today I think tracing is fairly democratized and people who need tracing can have tracing and systems like Envoy, systems like Istio and stuff like that they all support tracing and I think which is like the good part is when the tools that you use to send traffic or to like move traffic in your data center support tracing then tracing comes for free another nice thing that happened to us is EBPF and I'm personally super excited about EBPF operating systems were not as visible to systems engineer as they are today because of EBPF back in the day even like a few years back like I would look into CISFS or PROCFS to look at what the kernel is telling me about the health of the system but now I can load an EBPF program dynamically into a running kernel and probe the kernel data structures and find things like which application is making a CIS call or how is like the network interface behaving or if someone is sending me a packet I can write a custom EBPF program which can short-circuit like the chain of packets that comes from a network device to the application and sort of like short-circuit it that way and last year I wrote a paper on how personally I am using EBPF to build systems and it was published in username slogan so if you have usix take a look and I think I generally think like EBPF is really exciting and we just sort of like changed the whole landscape of instrumentation of operating systems in general cluster schedulers I don't need to talk much about it everyone uses Kubernetes these days but I think even then there is a change in pattern of how people are using cluster schedulers 6-7 years back when we were developing cluster schedulers people would tell us that they are rewriting their applications in such a way that they can run them on cluster schedulers but I think now we are taking going in the other direction a little bit there were still like certain systems which were hard to write on cluster schedulers like databases or persistent systems now cluster schedulers are application everywhere with things like Kubernetes operators or at Facebook Tupperware has something called the task controllers which are linked here and those kind of things allow you to codify running things like a database or a system like Kafka on cluster schedulers which was very hard to do like even 5-6 years back and lastly a lot of systems need data and they need data in such a way that it could be useful to train deep learning models and so on and there are systems now like workflow systems Dagoba from Netflix Lyft recently announced a workflow system called what is it called Flight with systems like that you can easily write reliable data pipeline systems which take data that you already have and then extract features out of those data and then allows your training algorithms to ingest those data I think as AI and machine learning becomes more and more prevalent in products which are exposed as web services we need reliable data pipelines as well because most of this machine learning or deep learning models actually improve like how well our online systems are doing and so let's talk about like availability and reliability patterns that I'm seeing over the last year and so as I said there is public cloud available pretty much like in every major market so when we develop a new service obviously we think about how do we make our service available in every market and one quick way to do that provided there is enough resources is run those services in every market so if there is a tsunami in market A like market people in market B can still access the services that we provide and not just that with more regulations and things like that as laws about data changes it's very important or very essential or very useful to have a global presence in that manner like you can have policies which are very specific to a given market and with having services in every region there is a need to load balanced requests from end users to those services and that's where I think global load balancing comes into the picture so let's say like I'm running services in Europe and US and for some reason the deployment in Europe goes bad in that case having global load balancing systems move all the request to the US sort of like saves the day and again this used to be harder back in the day and there are things now like about 53 and so on makes it pretty easy to like move traffic from one region to another third is having spanner like distributed persistence until like a couple of years back when we talked about having availability in our data systems we would talk about systems like Cassandra for example like which would have replications between data centers or between regions and so on and people thought systems like MySQL or Postgres or relational databases which do not have capabilities of reliable data application across over the van is going to sort of fade away but those systems are also pretty more user friendly or programmer friendly to use now I think we have come with spanner like data systems we sort of like are in a good spot where we get like all the reliable data transfer between regions but we also get the programmer friendliness of having like acid like semantics or relational semantics in our databases so for more use case we can have a distributed data store without giving up on the things that we liked in non-distributed data stores and we almost all of us have seen like outages on major public clouds where a single configuration change had brought down the entire system and for that we need more phase rollout of code right when we roll out code or configuration we don't want like all our systems running in different regions to be impacted at the same time we would like to see like whether system is behaving well in a given cluster then to like a given data center and then to a region and so on so start deploying code and configuration in the smallest blast radius that we have and then from then like gradually increase the blast radius and lastly I like to think about federated control planes sort of like changing the game control planes are pretty difficult to build like reliable control planes if you have worked on systems like PAXOS or RAFT and understand like the challenges of maintaining a system which uses PAXOS or RAFT then you sort of like see where we have gone from circa 2013 to like 2020 like back when I was working on mesos in 2013 we used to suggest that run all your mesos masters which is like the control plane of the cluster scheduler in one data center and distribute all the agents in the other data centers if you had more than one data center from there we have come to a place where we are running control planes in Kubernetes in every single region and you know users don't even know like to which control plane their request to launch a job is going on to launch a container is going so we have I believe we have come a long way from that perspective is to have like federated control plane so that we still can talk to any of the control planes and control like all the other systems that we have in other regions and data center in terms of reliability I think the trend that I see which is most prevalent is using client side load balancing like smart client side load balancing like even until a couple of years back we used to talk about load balancing in the perspective of like engine acts and each a proxy and all other systems like which were fairly traditional right like like someone would write a policy for load balancing and then the load balancer will act policy but now we are we are seeing systems where the ions which are sending RTC like become smarter back in 2004 or 2006 I think I read a paper called tale at scale or maybe even earlier like 2002 which was all about smart load balancing which was about reducing the 99% of latency right and if you have a general load balancer like a like engine next or whatever reducing 99% of latency sort of becomes harder because then you don't really control at the client side the semantics of of of an RPC with client side load balancing you can do like things like parallel requests like if you have two services running Java or Golang or any of those managed on time safe one of them decide to do to do garbage collection you would see a jump in 99% of latency but with hedge request you send out many requests at once and then whichever server responds the fastest you take that response and show it to the user so with that like 99% of latency like reduces a lot those kind of things I think were not available until like recently as you know like as things like Istio and so on became popular but now we have those tools to do things like that flow control over the network is another such thing we used to talk about TCP and TCP level flow control so if like a producer and a consumer if a producer is producing a lot of data and the consumer is not keeping up usually the solution was like you know what let TCP do its thing and then over a period of time the network is not going to let the producer send so much traffic but now like we with reactive systems with reactive load balancing systems the application can provide the users or the producers can provide the consumers can provide the consumers producing data some back pressure by which like the producer sort of backs off and then we don't overwhelm the network and the application so those kind of things are more popular and I think more people have access to it and so when we are building new systems I would imagine like we would build those build our new systems using such technologies so pretty much this this is like what I had and you know in terms of like trends and building reliable distributed systems do we have questions or things now sure thank you Rishu Kalyan over to you in terms of responses Hi I hope I'm audible yeah great so this is this is an enlightening thanks deep down and I think a lot of these systems and patterns that we talk about are now becoming part of the industry as in when we are going I would as always one other aspect of it which is the people who put these systems together and behind the scenes people right there are I remember having the chat with Kalyan a while ago on this was that hey every time we put together a system it is a certain combination of some parts that we see the system going and then there will always be an outlier that will cause an outrage or that will cause a problem that the system will probably not be equitable and when we talk about these very key aspects of availability of performance how do we kind of also ensure that any how do we have that feedback that if there is any incident how do we track that how do we make these systems even more and more adaptive yes machine learning is one way and I think yeah we are truly heading towards what you would call as a skynet age it absolutely more or more danger than that but but thoughts on how we actually run and maintain these systems because we say okay this is how a system has to be architected has to be put together has to be thought about more insights on that and the second part that I would always have is that when you talk when we talk about independent geolocation independent systems as well and we say that hey it's super awesome to have systems where the client doesn't even know where the request is being handled from how do we kind of do we take the same metrics or do we kind of have a model where even the teams working on these systems can use that model and correlate it to these metrics because that even in the times that we are living in that's a very pertinent question for all of us yeah yeah these are very good questions so let me let me take answer the last question first because that's like I remember the most so to me like I feel like every system boundary needs to be well instrumented right I'll go to like an extreme here on EC2 for example like if you're running an application and the application is using the file system chances are pretty high that the system is probably going to be writing to elastic block storage which is a networked file system and with anything over the network right like I would imagine the latency to be sort of varying during the course of it or even during the course of like a short period of time so I think like at least every IO request in that case needs to be instrumented now you would think like you know I'm using a POSIX file system like POSIX file system API like how do I measure like the performance of the POSIX file system because back in the day when POSIX file systems were created you know like 20-30 years back like you know there was no monitoring or metrics or instrumentation but today you can you could do EVPF on linux or detrace on on a on a sun micro systems or not not sun anymore but you know what I'm saying like open solaris or whatever detrace or something like that or on a BSD on a BSD kernel and measure like how well your POSIX your POSIX APIs are doing so that's like an extreme example but let's say like a microservice is doing like RPC requests to like another service no matter where those RPC requests are going like the boundary needs to be very heavily instrumented right like if I'm a team running the microservice which depends on five other service like on my dashboard the first thing I would see is like how well are my dependencies doing and when during an outage like that's the first thing I would see it's like what kind of errors am I seeing or what kind of latency am I seeing in the dependencies from the dependencies of my system of my system and I think okay first of all like does that make sense hmm I think that's fine let's keep going on this I think it's a good way to kind of see that how somebody in those shoes want to look at what metrics to look at dependencies and so on but I think the question goes a little deeper in the sense that when we have when we are going to have people operating these services and obviously there is going to be an outlier and an outage how do we kind of build models for teams to work just like remote systems you also have remote teams and it's a very different kind of challenge in terms of availability in terms of time zone in terms of knowledge asymmetry these things come to the fore and unfortunately they usually manifest themselves when things are down the dumps at times so how do we kind of enable and empower teams to come up with a model that ties in these key system metrics or these key metrics and say hey this is how we can even build teams that function better like the site reliability engineer right right yeah I mean I think like the protocol needs to be laid right like if you have built a team and if you care about reliability like sort of like thinking about what happens when the system fails or the dependency fails like that protocol of like what do we do as operators of the system when something is perceived to be slow or something is perceived to be not available like what are the protocols that we have right right and if I simplify it further it basically means like having a run book which people follow right so that like if you hire someone tomorrow and if they don't know much about what is going on like there is a protocol laid in front of them and we also have to be we also have to have like some escape hatches in the protocol for majority of the for even if like a person like sort of like solves majority of the problems in a system like the escape hatch could be how does that person how does a team member get help right like what do we what does what do they do like to communicate effectively how much they have looked into the problems or how do we communicate in the first place that there is a problem right and someone tells us that something is broken we need to we need to understand like what is broken right I think that's the first step and like that's like the first step of the protocol so like we design our protocols in such a way where we streamline incident response right we streamline how do we like react faster right how we streamline our protocol to react faster so things like that I mean that's something that like I would do like mature teams would do to sort of like make the make it easier for people to work with the systems that we are building sounds good sounds good I'll let Kalyan have a quick go taking all the time around here thanks Rishu primarily the conversation Rishu covered most of the conversations we both had primarily the situation so this whole conversation started I'll give some context behind it with the context that working with distributed systems really affected during a remote work scenario so that's the context which we started so our systems are anyways remote we are not our systems never needed close to them we always access them remote and we always okay leaving diplomacy we always treated them as cattle so if one system goes down we if you couldn't reach we make sure our availability is not affected but that's the same level of thing is possible sorry that's the same level of distributed work when teams work remotely is something that Krishna cropped up to us yes processes to some extent will definitely the processes is something that will help us there are the reason why the processes are not coming up so we had some more conversations on why the process are not coming up everybody have a process so some of these things come up to our mind so I will just list them here I think we can go around seeing how it is implemented or how what are all the starting points for people to implement so one of the things that came to our mind is if a process is being brought in how do we measure that the process is working so some things like will it be like do we measure the process is working like how we measure systems uptime how we measure systems availability or how do we measure it is there a way that we measure it I mean in my experience whether it is distributed or in person I think the heuristic of measuring success of how well we operate our systems is I don't think it's like how many nines we have I think how many nines we have is is sort of like developed over times over a period of time as we make our systems more reliable but in terms of how effective our process is in terms of how we operate our system I think the measuring the heuristic force for measuring success on that front to me at least is mean time to react there are various mean time related metrics one is mean time to debug what that means is we debug our system and before we go to other things like mean time to recover and so on is like I think I like to think about how quickly have we figured out you know the steps that we can take as the steps that we can take to alleviate the user problem right so often like what happens I think is when we are working in a distributed manner communication I think sort of like is what is becomes crucial right if you and I were in the same room and versus like we are in different continents the main difference is communication visual communication and so on so with working remote and being distributed you know like visual communication sort of like goes out of the window unless like you know we are doing video conferencing and so on but during times like an outage I think I think the key to like working remotely is to communicate very well is to express like what is the problem what we are trying to solve in a very crisp manner so that like whoever is solving the problem or everyone in the team understand the current state and from there then on sort of like figuring out a protocol to sort of you know like how do we debug and after the incident is done you know someone has to record like how long did we take to sort of react to first out it how long did we take to sort of understand like what is happening and what are how much time did we take to sort of like alleviate the user problem so I think if you have a timeline of things that happen during an incident right I think that's the first step towards you know measuring whether we were good or not once we have a good timeline then during an incident review you know you can discuss like how well did you do as a team or how well did we do as a collective in solving the solving an outage I think I think two things first is like an incident timeline and second is like the incident review sort of in my opinion which can be used to measure that makes sense that makes a lot of sense yes so one another thing that came up is if we have to actually bring up a change in the way so what you said makes a lot of sense primarily communication is the key secondly you have to have a post bottom of how you do things and correct code do a course correction if something is needed so this might this has to be a practice in many places if it's not a practice in place and bringing it there will be always some some you know resistance to adapt to the change primarily because people think this might not help or something so there has to be some amount of evangelizing that has to happen so is it do you have faced an evangelizing kind of thing or do you think the matrix itself will speak for it and it's just make sure the process is running I mean fortunately I have always worked in places where these systems these things were in place so I personally haven't set up a program like an incident review program but they were already in place and I was lucky to sort of go in and observe and be a part whenever there is an outage I would present the story of the outage but if I had to do it myself I think I would sort of like refer to what John Allsper talks about incident management it's like sort of creating a safe environment for everyone and sort of talking about the fact that incident review is just not to sort of like understand like how well we did as person, as people as humans incident review reviews are also like a place where we talk about how our systems are going to evolve in the future I have always found out found that incidents like which were very hard to tackle or incidents like which were not good or we were not proud of led us to make evolution in our system architecture in our system design and so on we talk about making data driven decisions when it comes to building systems or evolving systems architecture incident reviews are a good place to sort of like present our case of why our system needs to evolve so I think like making as an environment like that where the focus is on improving our system and processes I think it's like a good start but having said that I have not done it with myself like I have not set up the program yeah so Kalyan that's actually a very interesting question so just taking a minute off it so I've had some experiences in setting up similar processes right where how do you do incident management processes and yes this is where the people aspect compared to system aspect comes in right the to give you an instance there was one of the scenarios where we were having a system where we had multiple test and then we said okay what do we do to collect data to show how bad it is we said let's start filing incidents for every single SLM that happens all of a sudden the number of incidents we were filing went up by like almost by a factor of like by almost 100% because now everything is getting filed were we capturing all the data that we said yes we were but at the same point in time our people aspect that process was very taxing because every time we file an incident like you said there's a meantime to response there's a meantime to resolve right there's an incident controller took a toll on the engineers who were actually on those incidents because you would have multiple incidents at the same point in time right and I have a redistributed system so setting up these processes I think the scalability of the model itself comes into the picture looking back at it this was some years ago looking back at it I feel about that process or when I look at that process I realize it was a very data driven very systems process asking a bunch of systems to kind of do that versus asking a bunch of people to do that is a very different ball game altogether and I think resistance will always come till people do not see an uptick in their daily lives for example if I'm making a system brighter but it involves the sre's or the engineers on board the problem you work instead of 10 hours instead of let's say 8 hours then obviously it is not going to be a model that will become very effective because you will see you will see you know it falls elsewhere so it's a very interesting problem domain in itself that how do we create models for systems and for people and I think we could have a chat on it some other time as well but just letting my own experiences on it putting my own experiences on it got it I think Neelish has some question Neelish do you want to pitch in here with your questions yeah we'll just hold on we'll yeah you can pitch in now unmute yourself and you can speak your question or the comment you want to make I think as we wait Neelish pitch free to cut me off so from my point before this talk we had a very interesting discussion that distributed systems are really affected due to this whole remote work scenario so I think what one of the positions that I had is we are not generally affected badly because we most of our works can be done remotely most of the problems that an SRE in general face is pretty much similar to any other problem that happens some of the nice points are brought about communication we have to be exactly to the point when there is an outage and the war room is right now virtual you have to actually be properly communicating so that the people are not touching each other's toes and second thing I find that is challenging is we have to build a community because you know when you meet persons in office that because if you see this presentation today it's like five years the landscape has changed so the tech is kind of changing so you have to constantly have you know some discussions on how things are handled how it's kind of like you have to have yourself updated with what things are happening so probably when you work remote it depends on the persons availability some person might get time to do a catch up some person might not so that was early getting offset once you have a workplace but some kind of time spent on intellectual conversations with a peer group so I think that's something also we have to have we have to actually spend some time having a peer group be it virtual or something so that we stay updated yeah absolutely I mean the social angle of I think like social angle is pretty important and sort of like talking about like you know like how systems are being managed and how we evolve our systems and you know someone has a good idea of you know like of doing a new project I mean if you are having those conversations near the water cooler or during lunch I think it's important to catch up with co-workers I certainly do and you know talk about even like what I'm acting on like in my free time we don't have any questions yet on youtube live and I'm not sure if Miresh is able to speak out what he had raised and about I see a question though I see a question in zoom so here it's from Ranky how do you keep your teams updated from the competency perspective do you have playbooks, runbooks available I think it's kind of covered but I think more I think it covers two aspects one is the tribal knowledge how do you make sure there is no tribal knowledge and it's distributed across your team and we'll just playbooks and runbooks take care of that Dithan do you want to take that yeah I mean for sure I mean that again I think that's a pretty good question and as I was talking about this I mean part of like laying down the protocol is like having people know where is the knowledge base of like knowledge base of how the system works how the system is built how the system is getting deployed how do you manage traffic to the system and so on pretty much like every knob that is there in the system like to manage and operate it there needs to be a knowledge base and the knowledge base should not live in the head of people right because then like we end up in a situation where you know like we depend on people we depend on like the collective the collective knowledge of like the of a group to resolve a production issuance things like that so from that perspective like I like to think about it as if I develop a feature and there is a knob to manage it I'm going to like document the knobs when I'm putting this and before I put this in production and when it has gone into production I advertise it to my coworkers I say that this is a new feature that has gone into production and here are the knobs of how you are going to manage and control like the feature if you need to turn it on this is how you turn it sorry if you want to turn it off this is how you turn it off and if you want to change some configuration these are like the steps to change the configuration so that's something like I personally do and I would imagine that no matter whether we are remote or not remote like this is something that we have to do to keep our systems up and running 24 by 7 I just saw a message Hi, my name is Achu Yeah, I think having a good culture around this is kind of like the most important thing that you can set up when you're setting a new team or when you're trying to build all of this process out one thing that we've done pretty effectively is that every whenever someone new joins the system joins the team we expect them to dog food either the service that the team is responsible for or what not and to discover all of the gaps in all of the documentation that's been written so far because of feature drift or whatever and then collectively as a team we try to figure out what the answers to these gaps are in documentation so that we can plug them so yeah, I think to what the plan was saying I think having a good culture around making sure that all of this stuff is well documented is pretty important just to introduce myself my name is Achal, I'm a software engineer up here in Seattle I used to be at Uber for like for a half years I worked on that infrastructure and then I also worked on their machine learning platform we operated like a bunch of machine learning infrastructure at scale at Uber very recently I changed jobs and I'm now working on a startup in the data ML space building feature pipelines in a feature store for small to medium businesses like on board all of their feature transformations on to yeah so did you like work on hoverboard, Michelangelo and those kind of systems yeah, so my team own Horovaad and Michelangelo so Michelangelo is a pretty large team but yeah, my sub team own Horovaad and a few other things and like their deep learning infrastructure space cool, pretty cool stuff we have roughly about 8 odd minutes before we want to close no further questions yet from the audience is watching live I suppose yeah, if you want to do like a wrap up or a summary this might be a good time should you want to take it yes, so the one of the things I want to bring along with the documentation grant is yes as a new joining it's usually the practice to give the documentation and then see how it how it's still valid in today's world there might be feature trips and stuff so it becomes really difficult when you on board a new joining at this point when everybody is kind of work from home still so I think it's not difficult it involves more effort because you could have again it boils down to the communication where you can't have one-on-one whiteboarding session it becomes again a virtual session but sometimes I feel the people lose the same same emotional touch in the questions, questions gets refined certain questions are pushed down when you have to put them in when you have to know them and then ask a little later than walking to a person and then talking to them but I want to to that friend yeah that's fair I think in my last month or so when I actually had to on board a new engineer on to our team we faced similar issues where it's hard to get communication right because you're not meeting in person and you've never met this person like face to face I guess the I don't have a magic bullet here you just have to try harder to make the effort to go the extra to make sure that you're creating like an environment where questions can be asked and that you can connect with folks it's yeah there's no magic bullet other than you just have to go the extra mile to make sure that you over communicate and you add on the side of soliciting more questions and soliciting more comments and feedback from folks when you're onboarding new people just to make sure that you can answer these questions and have folks be productive sooner rather than them having to hold on to their questions and then come back to these questions at a much later time makes sense yeah so it's everything boils down to how we communicate to put from another perspective the systems have evolved faster than how humans communicate to this situation so I think that has to be now it's time for humans to we spend time in making the distributed systems work well so it's time for the humans piece of it to catch up and you know work and keep the systems up and running even though the whole team distributed there is no need gaps and stuff so there has to be you know processes which fill that piece of it there's one question one more question I got on zoom it's like we could probably add it if we have it on hand or later we could I think the presentation will be shared we could probably put it there so the question is what blogs or go to resources you refer to from regular reliability point of view so like personally I have been I don't really read a lot of blogs these days the Google SRE book was pretty pretty good I think which came out a couple of years back I think they have a new book now I think it's they're pretty solid back in the day I used to read at CIS Codescraft I don't know if it's still good or not I mostly read like papers from conferences and so on and that's pretty much like how I would keep myself updated and I read ACMQ it's a magazine I think it comes every few months I did use next login as well once in a while LWN is a good resource if you want to learn about systems engineering and just generally about the current old stuff on the data science front really I think there are a bunch of like medium medium blogs or distil distil is pretty good that's pretty much it I would like to hear what everyone else is reading actually if you follow the right people on twitter you can tailor your twitter experience to be heavily focused on systems and distributed systems that's basically what I've ended up doing that's true actually that's true distributed systems honestly like this is a good conference to follow NSDI is a good conference to follow most of the interesting work which comes out I think lands in one of those conferences always DI is another good conference so that's quite good amount of resources I feel I think we are done if we have no questions I think we could wrap up sure I'm just checking once more if there are any questions on youtube live no I think everyone is waiting for your slides with Tanu I need to figure out what is that gold mine but yes we put up the so I think everybody is waiting for your slides the only questions on youtube live are whether we are going to get the slides so I suppose there is some gold mine hidden there so with Tanu we will have to make a slide public and we will get in touch about that very much I think this was an interesting conversation we should definitely catch up privately and how we can continue to have these conversations I think these are important issues of our time while tech is tech I think the fundamental issues really boiled down to culture and organization thank you Achal also for joining the chat it was really nice having input from you and we hope you have a good night sleep tonight for those of you watching us it has been dot com where these events are happening should be signed in to join up and you tell us what you would like to hear more about we are listening we are happy to put things together for you thank you Rishu and Kalyan for moderating this and have a good day the rest of you bye thanks everyone thanks a lot guys