 Okay, we'll start with the last talk for the day. So, hello everyone. This is the last talk. So I'm the one keeping you from going home. Really? And so... Actually, it's dinner. Oh, it's dinner. Okay. So the talk that I will be speaking about is how do you improve the fault detection and how do you capture real-time analytics in any application that you write as a software engineer with telemetry? So how many of you know what telemetry really is? Okay, so I'll go to what telemetry is. But before that, I'm Siddish. I work with Microsoft Research as a research fellow. I work with the Technology for Emerging Markets group. And prior to joining Microsoft Research, I was with Microsoft Engineering, where we worked on this product, where we worked on this product called telemetry capture, our application insights. And what we did was we improved one product, Office, which most of you know, it's reliability from 99.99 to 99.999% using this tool. So I'll just take you through the story and how this tool really helps you, as well as if you have a startup or if you're a business, you understand what one minute of downtime could mean for a business. So this is how you can avoid mistakes. So as we consider, it all starts with the printing press. That's where we have documented evidence or that's where we started filing, that's where all documents started coming in first. There was some amount of data that the printing press had to collect back in 1450. But now, if you see, the amount of data that we're looking at is humongous. But we have the same amount of time. So what do we do with such a problem? We have the same time, but we have lots of data which we have to analyze. So for that, we start trying to skip through. We use machine learning. We use different kind of techniques to find out what this data really means. So this is what I would be talking about in the talk, giving you a brief overview and then running you through what are the current software engineering and debugging issues that really take place. What are the mistakes that many people or many startups actually do when they're writing software code? How do you fix such mistakes and why this is really important for you? Why fixing these mistakes is really, really important? So if you see, the major issues that we find with debugging and performance monitoring is that it's very hard to find out where the issue is. So if you've hosted an application on the cloud or if you run your own servers, there are times when your server just goes down or your customers complain to you telling that I can't access your website or I can't access your web application. But then it becomes a very huge problem for you to find out where exactly the problem is. Now, it's a very long process in most companies and most startups, generally. So you go back to the code, you first check what the problem is. You try to debug every single statement, find out where the problem is or why the problem has occurred. Then you try to narrow down to figure out if it's a server problem, if it's a hardware problem, if it's a software problem. And then you try to then find out if it is a library problem. So generally in case there is one dependency that's pulled out, especially with very modular languages like the way we use today, like Node.js or Python, there are lots of libraries we just download. And sometimes these libraries can be updated and they might break the rest of the functionality of your application because of the update. And many of the times because they're open source, there is probably a documentation out there for that, but which you may not be aware of until you actually go back to it. So these are the main problems why many startups or many companies, they have regular issues with their software solutions. And the main challenges are that we have large code bases, we plug in too many libraries, we plug in too many dependencies and we have moved most of the software engineering practice from a waterfall model, that is you first sit in design, then you code, then you show it to your customer, you test it and then you deploy it. So we moved from such a cycle to doing two-week sprints, as what we call it. So because of this process, we are actually doing all these four steps on the fly and we're doing it really, really fast. And whenever we do something really fast, there is a high chance that we actually mess it up. And continuous delivery is a term that is actually used where you don't have to manually go and deploy code on the server which was previously done around a year or two ago. Right now there are systems where you push your code, they compile it, they generate your binaries and then they actually deploy it on your required services. So there is a lot of middleware that you're using to get one website or one application of yours up and running. And there could be a problem anywhere in this entire pipeline. So what are the common mistakes that many people have, many people make when they're engineering these solutions. They make assumptions about how the user is going to interact with the system. So if they say I'm going to place a button here, the user may or may not click that button. How do you know that the user will really click that button? How sure are you that the user understands what the design is? So these are some of the questions. And added to that, many startups, they don't do test-driven development. So you have absolutely no unit tests. I mean frankly, let's admit, how many of us have written unit tests religiously? Oh, you've written unit tests religiously. Okay. Similarly, there are lots of companies which still do manual deployments to their server rather than using continuous integration deployments. And they start thinking about XIT. That's what we call it. So XIT refers to scalability, configurability, extensibility, any of these ITY words. So they start thinking about this all the way from the start. Even when their application is very, very small, they probably start breaking it down into microservices, which you probably don't need at the scale in which you're operating. And these are generally engineering overkill that happen. And they become very hard to maintain over time. So let me give you a classic case of over-engineering. This is one of the cases that has happened where a project manager just comes up and says, hey, let's build a CMS, which is configured, which is very good for extensibility. So my customer can add whatever form he feels like. And the database should accept that fields and make a table in that specific fashion and take his data in the specific way. But dev, the developers who are working on this project, so they think it's a very challenging idea. They think it's amazing. They think that the world needs something like dynamic form generation, dynamic table creations, and things like that. And they start building solutions for that. So after a lot of design patterns, after a lot of usage of all of these fancy software engineering practices, they end up providing a solution to the customer. But what the customer really needed at that point was just one developer who can sit with them and configure it the way they needed it. So you didn't have to build a generic CMS for doing something that the customer had asked for. But this is what you felt would be a very good engineering practice for you to do. So there is obviously a debate. Now the debate is, have you given a very good engineering solution or have you wasted a lot of man hours of programming or a lot of man hours, which could be put up somewhere else? So I will leave the decision of the debate to you. But this is, I believe, is a classic case of over-engineering. Now what exactly is telemetry? So telemetry is the process where you detect your triage and you diagnose all of these operations that take place in your entire application ecosystem or your entire company in one step and all in one single place. So you have the data about how your servers are working, how your application is working, which parts of your applications are taking, how much response time and things like that. So all the analytics that you need, it's all there in one single place. So why should you as a company use telemetry? Because you're probably using multiple tiers and components. So there are lots of components that you use from third parties or there are components that you've built yourself. You're probably hosting your applications on multiple data centers. So it's geographically distributed and you want to know in case one's particular service has gone down. And there are also different clients with different browsers. There are different mobile phones. You want to understand who are your clients really, how they're interacting with the system and you want to understand the behavior of your client in much more detail. So to overcome all of this chaos, you need to use telemetry. So the reason why what it does is it simplifies the diagnostic process which is here. So the first thing generally what happens in a diagnostic process is a customer comes and complains about the app. He says, hey, I'm unable to log in with Facebook or I'm unable to log in with Twitter. Can you fix this problem? And then you try to find out how many times this problem is occurring or where is this problem occurring? Is this very region specific? Is it because I have configured my Play Store to be only safe for India or only for Singapore? Or is it probably some other reason? Is the error very unusual or is it recurring? These are the type of questions that you as an engineer or as a program manager would ask the rest of your team. But these are exactly the same questions for which you need data and you need it really, really fast to fix the problem of the customer because you have to obsess about the customer. You need to understand that the customer, you will lose the customer if you don't fix the problem in a given amount of time. And going through this entire process of finding out where the error is, how the error is happening, when the error is happening is a very troublesome process. So telemetry is very useful to fix the entire workflow in just one box. So where should you use it? You use it in mobile apps, you use it in desktops, you use it in web applications, you can use it in servers, background services, even for finding out infrastructure-level stuff. Like in case you're running machine learning algorithms, in case you have TensorFlow or any of these clusters or Hadoop clusters that are available, you probably want to find out how much GPU is being used or how much CPU is being used. In case you're a storage company like Dropbox, for example, you want to find out how much storage is actually left in my data centers and plan well in advance to add new servers to the rack and things like that. So what do you do? As a developer, you add a few lines of code into your application to capture the telemetry in between all of your code. So it could be middleware, it could be front-end clients, it could be backend services, it could be servers. You add little bits of code here and there to capture all of this data. Then you can push all of this data into one single cloud service, which you will host or which you can use as a SaaS. And you run some machine learning on that and you have a portal which gives you all the insights that you're looking for. Like how much disk usage, which application is doing well, which application is not doing well, what is the response time and things like that. So why exactly is it important to developers? So you get to see how the app is performing inside out and you get to ask, you have the data for very specific questions like what is my total page load type? What is my page load time on different clients? How are people on 3G doing different from 4G? And how many people are actually using my application from 2G networks? So how many people are using this from Wi-Fi? How many people are using it from different countries? So everything, all the type of analytics that you probably need is something that a developer is very interested about. He's probably interested about how much time his API is taking, how long do people wait on a particular screen and things like that. Also, there are a lot of dependencies that you probably use. There's no SQL, things like that. And if you have very complicated queries or triggers and all of these processing, you want to find out which trigger or which database entry is taking how much time. Can we optimize on these procedures? Similarly, you can also find out if there are any failures. This is something which you're more interested in. You can find out if there are failures. You can find out if the exceptions are caused on the client or you can find out if the exceptions are caused on the middleware and you can also find out why these exceptions were caused. Does the client not get access to it because he has a very flaky internet connection? Or does the client not get access to it because he's getting timed out from your server which he doesn't understand? Or is it because there is something going wrong at a specific time interval in your servers because you're doing some maintenance which is affecting a lot of people? So you get to answer all of these questions in just a matter of minutes ever since your application goes out. So it gives you very specific data like it has affected 50 users and 75% of the users who have come onto your website in the past one hour have experienced the same problem. So they've all probably hit page 503 or page 404 or something like that. So something has gone wrong and why is it important for program managers who manage this entire team? It's because it gives you a lot of features for you to understand your customers. So if I was Facebook, I was probably interested in how much time my users are spending on my website. Where they're clicking, what they're doing, where should I place my ads accurately? How should I make the experience more engaging? Is there a feature that users don't like where they click and they just go back? They don't want to go for it any further? Or where there are these bounces? So it's very good to get feedback and you can make data-driven decisions from this. So it's not that there is a customer who comes to you and says, I don't like this button, but you get to actually understand why that button is actually not being used in your entire application. So you can actually make data-driven decisions from telemetry. Similarly, you can release the same features in different implementations to different groups of people. Say for example, you're a startup and you want to test out the feature of comments. So there are different ways in which people use comments. So some of the comments are like how we make it on Reddit. Some of them are like a reply-based method like Twitter. Or you could have comments like on Facebook. So if I was a startup, I would probably push all the three updates at once to different groups of people and gather data from them. So this process is actually called flighting. So telemetry allows you to do all this flighting. It allows you to test different design approaches at the same time so that you can gain user feedback and take the best of the world. You can take the best of this, best of that, probably combine and retest. So it helps you make very good decisions on what design is really useful and how the application has to be built. So why is it useful for a startup? Every crash could really, really cost you. And you could also say the same thing for large companies. Say for example, Amazon. So the recent Amazon S3 Outage, it happened for I think six hours or so. And they had tremendous amount of stress on each one of them. That would have been solved if they had telemetry with them. So similarly, yes, you can use these decisions to actually make marketing decisions for your company. That's something I'll just explain about in the story that we have. Yes, sorry, you had a question? It doesn't work. Which one? The feedback thing. Because on YouTube, every time I get an answer, I click on skip and I still get answers. They should know that nobody works with them. All right. So I'll just run you through a few of the use cases that we've seen for a few products that we launched at Microsoft. You can actually track how many people are downloading these apps. So initially when the app was posted, then the excitement died down. They reinvested back into marketing and you can see it has stabilized. So you know that you can take these kind of decisions with respect to your marketing. Similarly, you can also track how many of these users are returning users to your application. So say for example, in July 2017 or 2016, you had so many people joining in and this is the way in which you could retain your users over an entire year. Similarly, you can see real-time metrics. So there were 741,000 people online at a given point of time in Office 365. And at 6 a.m., there was a crash on the server. So you could see that one crash on the server resulted in an increased page note time and increased server response time and a drastic reduce in the number of requests that were actually sent on the server. Similarly, you can also notice the languages in which people are actually using it. So people are actually using 29.14% of the total users, use it generally in English. The rest of them use in Chinese, German and so on. But what Microsoft has been doing for a very long time, or any company for this matter, it has been investing only in advertising in English for example. So using decisions like this, you can say I probably need to invest more in Chinese. I probably need to invest more in advertising in Chinese because I get people from there. So for a start-up, this is very crucial. It allows you to make decisions like that. Yes, you want me to hold that? Sorry? Yeah, the server goes down. It generally is load balanced. It generally is load balanced, but I think this was a case of very extreme outage. If that happened, so... It happens. I mean, you see, there were absolutely no failures and there were a few failures that showed up. So this is probably because there was some developer who wrote some very buggy code and then it got fixed, but I think something happened and it showed up again. So issues happen, mistakes happen with engineers but before the customers actually come and complain to you, so this has been done in a span of a few hours. So this is, I think, just three hours for you to fix the code. So ever since you figured out where the problem was, you fixed the code within three hours and you deployed it onto the system. In these three hours, you probably had very few number of hits to your website and it was probably the night in America, but there was a fix that was still shipped out. So it matters. I mean, at a global scale when you're operating, production outage could cause issues. Similarly, how do you really use telemetry in your application? So there's this product from Microsoft called Application Insights. You can actually head over to the link and there will be a lot of things that will be shown up and these are all the languages that we support for writing telemetry. So in case you have a Ruby server, which is running a website, you can include JavaScript and Ruby together and in case you have something else, like a WordPress website, you can track how people are using your WordPress websites. Similarly, it's quite simple. It's just a matter of few lines to include this. So you can just import telemetry from the library. You can just track a specific event. Say, for example, if you're a game company, you want to track how many times people are pressing the up key, you want to track how many people are pressing different keys, you want to track how many people are probably cheating, things like that. So it's very easy to do that with just a few lines of code. And even for the client, for the client, it's actually really simple. So just like how you do for Google Analytics, you can just create it, copy this code that you get on the Azure portal and place it into your website and everything magically works. But wait, what are the alternatives that we have? So we have Snap, which is an open-source telemetry framework, except that this is a software as a service on the cloud, whereas this has to be hosted by yourself. So you use infrastructure as a service, you spin it up and then you host this application all by yourself. And so to write telemetry, you need telemetry for telemetry in this case. So you have to understand how your telemetry application is going and what happens if this guy goes down. So that's a circular problem in itself. But yes, there are a few advantages and disadvantages. Okay, okay, sorry. There are a few advantages and disadvantages here and there. So you have... So this is purely Microsoft tech. In this, you can integrate it to AWS. You can integrate it to Cassandra. You can integrate it to RabbitMQ. You have a query language for application insights, whereas for this, you don't have a query language. You have to download the log files and you have to actually look at them yourself. But one of the major problems with this is that the telemetry that you collect, it is actually not customizable. So you cannot actually write... As a developer, you cannot write code telling, hey, someone has pressed the left key or someone has touched this particular button on your website. You cannot write it with this because it has predefined packs, like for WordPress, for Joomla websites, for Medium websites, things like that. So you just use it by default and it gives you some metrics regarding the application. And I understand you have a few common questions in mind. So what about the system load? So does including telemetry as a library hamper my performance? How long is the data available for me to make decisions out of? How expensive is it? Yes, and will it slow down my web pages because I'm adding a new library? So the answers to all of this is it actually does not send the requests to the server on the fly. Whenever there's a button click, if you have programmed to find out how many people have clicked this button, it does not send requests on the fly. So what it does is it caches up the requests and it puts them in your web browser. And at some given point of time later, it shoots in all the data to the server at once. So it does it periodically over an interval of every one minute or two minutes or so. But yes, you get your data which is slightly outdated, but for an engineering system, I think that's more than enough. And similarly, there is a company called OSIsoft, which is an oil and gas company. It's within the Fortune 50. And they use telemetry for most of their IoT devices that run on their oil fields. So they get to find out which sensor is going down or which sensor is giving faulty data. How do you fix such a problem even before there's a problem or a hazard that has actually happened? Similarly, so what do we take away from this? As a developer, telemetry or trying to find out about your customers or trying to gather data about your customers in voluntary fashion should be something that you have to consider, but which is generally not considered by most teams. Similarly, there is no single platform where most of the engineers or most of the products push all their data from their entire infrastructure. Also, it's always good to take data-driven decisions than actually take decisions because you just feel like it and telemetry allows you to do that. So Microsoft has been one of the largest contributors to open source. It is, I think, the topmost contributor to open source from this year, I think from the last year, from 2016. There are a lot of projects that are available. So Visual Studio Code, the .NET platform, Microsoft CNTK for machine learning and TypeScript. Leaving that, we have a lot of products open from Microsoft Edge, similarly from the Azure Cloud and from Office and also from Microsoft Research. So you can feel free to download them, contribute to them and work on them. And to help you out, there is one month free as your passes in case you're interested to go ahead, explore and test out the solutions. And in case you need an extension, ping me. So that should work. Thank you very much. I'm open to questions in case you have any. Sorry? Oh, you want the pass you other? Okay. Is that okay? So if anyone has a question, I'm open to answering it. Okay, great. Thank you so much.