 We represent Akamai, most of you would have probably heard of Akamai. We are a technology expert when it comes to, you know, getting the internet worth, getting, you know, how customers get their content to their customers fast. So we come here representing Akamai with respect to what it does for the analytics space. So the way the popular perspective is, you know, as part of analytics, what we would like, first like to see is what analytics do we do, how we are going to help our customers. So first part of the presentation is going to be about the product, what does our analytics product offer, and, you know, what are its capabilities and what not. All in all, we realize that there is a huge tech crowd here. So the second part of the presentation is going to be about the technology, which serves our needs, our analytics needs. So I'm going to be talking about the product, and Akamai is going to be talking about the technology side of things. So if you have questions, please vote it until the end. We most probably will keep a good amount of time for Q&A at the end. Okay? So over the last couple of years, one thing we have all come to realize is the tremendous growth of the internet, right? You know, from my day to day, it's very tall. It has to be high. It's too narrow to know where everything is going to be as I space. There is, you know, in terms of the number of people using the internet, as well as in terms of what they see on the net, has to be tremendous growth. The amount of content that flows through the net is so huge, you know, we need technology to enable such a growth, such a tremendous growth, right? So as we are spreading from the study by 2015, more than 62% of the online traffic is going to be on the video. We have already come to realize it, right? You know, I have started watching movies online. I would never have believed that I would start watching movies so early, you know, in India, just like three, four years back. You know, if I played just a YouTube video also, I would, you know, keep re-buffering every once in a while. But now, you know, the speeds, the bandwidths are so good that I can watch an online video in full, and even an HD one at home, basically, right? So, such is the growth. And, you know, looking at all such developments in the field, you know, media and just predict that more than 62% of the traffic online is going to be real. So, if such is going to be the case, we need tools which actually help us monitor how the stimuli coming is sparing in terms of, you know, enabling such huge growth in the online video content, right? So that is where media analytics comes into the picture. So what we are going to go through is what media analytics is in general and what Akamai media analytics is in particular. So why media analytics? Anybody wants to take a shot at that? I just gave you a hint. That's one important point. Anything else? Absolutely. So there are two sides to media, right? One is you are in the market because you want to make money at the end of the day. And then, if you want to make money, you need to have all the required resources to help you, you know, ensure that your customers are happy about what you are giving to them, right? Absolutely. So those two are the prime, prime factors behind media analytics. So visibility into the business aspect of it, as well as the quality aspect of it is, you know, a real critical importance. So maximizing your monetization like you said, optimizing your product portfolios. From a business angle, you want to ensure that the content you put online is what your customers want. You don't want to go up on perceptions, right? You know, you need to have the good statistics behind which tells you this is what your customers want. So you are going to put a lot of your money in getting such content online. And a lot of your money in ensuring that that content is what your customers want. Then, clear out the marketing efforts and goes like you said, into the tools. Managing your distribution strategies and, you know, looking after how you are going to manage your costs overall. So again, going further, leap on to why media analytics. We would like to know, as someone who is into the online content business, you would like to know who is consuming your media. What is your user base? Where are they from? What are their, you know, media consumption tendencies, the patterns. Now there are different aspects to how a user consumes media. There are demographic aspects like, you know, we in India probably are more interested for this than most countries. So there are such inferences to be driven out because that is what finally helps you decide what content you want to put on your front page on your website if you are a news channel person. There are professional and social tendencies as well. How are your users consuming your media? Like we already discussed, what do they watch? And once we know what do they watch, we also want to know how long do they watch because user engagement is very important. Most of the online businesses are really around ads and, you know, that is one reason why you would want to ensure that you engage your users for longer as you can because that is what is going to in turn rise even, you know, costs. So, you know, to follow up on that, what is your user base? Now once you know what is your user base, you also want to know whether your user base is constant, you know, you have a loyal user base where once you build up a certain group of users they stick with me, they have my repeat customers, they come out to my website, you know, every day or is it that, you know, you have an audience which is more on a channel, you know, they come and go. You don't have a lot of repeat audience but you have a lot of moving audience, right? Because that in turn helps you what content you want to put and if you want to do some sort of recommendation system, you know, like if you want to track me as to what I watch and recommend me similar videos in future, you want to know whether I am a repeat customer or a new customer. And then once you have handled the business aspect of it, then it comes to the quality aspect. Now what are the tipping points to consumption? You know, I know that I want to engage my audience well enough but what is it that can add to the barrier to me engaging my audience? So the quality aspects of the things, right? So like I mentioned, you know, it could be the playback quality, things like, you know, do my users rebuff for a lot? Does it take a lot of time for them to start a video when I say I want to play a video? Is it that, you know, their experience? What quality of the video are they experiencing? You know, the bitrate at which your video is being served because the quality of the screen also matters to me as a user. If it is good enough on screen, I would like to watch it. If it's jittery, I'm just going to go off the video. I'm going to abandon my video and move on to something else. So this is where Akamaya media comes in. Now that we have seen one of the key driving factors, one is the audience engagement and the quality of streaming. So Akamaya media analytics has models which exactly cater to the needs of analytics with respect to audience engagement and the quality aspects. So media analytics is all about awareness. You want to know what is happening with your content, who is watching your content, blah, blah, blah. You also want, you know, once you have the metrics collected, once you have data as to what users want, what is the quality. You want to drive inferences out of it, right? Because the very meaning of analytics is to decide what you want to do next. So just having a set of key metrics is not going to help as much as being able to infer things out of it. You know, if I have this data today, if I do changes A, B, and C, I'm going to drive so much more revenue tomorrow or increase my quality so much more tomorrow. You want such inferences to be drawn. That is where the solution helps. Then Akamai Media Analytics stands independent as in, you know, Akamai is a major technology enable for streaming as such. But that is not to say that the analytics model is tied to the streaming network. Our analytics model can work regardless of what a content provider's network is using for streaming. This is how the media analytics solution in Akamai works. It is client-side analytics as in all the metrics that we collect or the quality metrics are from the end-user machines. So, you know, whenever you play a certain video, there is a certain flagging component that we have in front of the video collects quality metrics and other related metrics as we see it on the end-user machine. Such data which is collected on the end-user machine is beacon back to our cloud, the analytics cloud, where a lot of processing and aggregation happens and a lot of inferencing happens, which is what we collect and make available to our end-users as part of the GUI application. So, the first model that we have is the audience analytics. Now, whenever we talk of an analytics solution, we talk of reports, right? Like, if you actually attended a girl's talk which visited this, keep talking. Reports are the basic needs of an analytics solution. There are different kinds of reports. You have canned reports, but, you know, as an analytics provider, you know what most of your users need, so you get reports and then you make way for new reports to be defined. So, customizing the analytics solution is very important but as a result, because you don't necessarily know what all your customer might need. All you know is a guess and probably some statistics prove one, you know, fact shows to this is what they will definitely need, but what more, right? So, audience analytics is a set of reports, standard and custom. It shifts with a set of standard reports and then you can go and view your own reports based upon your needs. So, there are reports where multiple reports are put together for quick summary for the business executives on the media consumption side as well as on the quality side. As we go further into the slides, we will see, you know, we will take a few sample reports, a few sample dashboards and see what kind of metrics do they, you know, make all of it. The second model that you have is the quality of service monitor. This is like the name says, mostly to do with quality monitoring and it primarily tracks quality metrics or engagement metrics and a lot of quality. The jittery that you would end up seeing, the rebuffering activities that they are seeing or, you know, whether they are able to access your content without any pickups or not. Such other metrics have been selected on the quality side. But the key difference between these two models is that the quality of, see, the quality monitoring that you want to do has to be as close to your content being vast as possible. That has to be in real time because if there is a quality issue, now, there is no point in analyzing that issue after a day because the user has come here, seen, you know, experienced bad quality and he has dropped off. You have actually lost an opportunity to engage with users, right? So you want to see your quality performance as soon as possible close to real time. So that's where the differentiation is. Quality, the KOS monitor works in real time. The metrics are all built to you within a minute of it happening. The audience selects on the other side is more about historic, you know, reporting as we collect a lot of metrics, make it available for the user to go and analyze. Again, if we are more closer to the end user, we are our distribution network independent, we have cross-platform as in, you know, like I said, we have a plug-in with weekend data to us. We have plug-ins across all the major platforms, you know, flash-based players, Android, iPhone, HTML5 or something like that. An example, dashboard from audience analytics. This is the business summary dashboard. It actually summarizes a lot of business-oriented metrics for you. Like, you know, when I talk of business-oriented metrics, for example, the viewer-primed, you know, how many viewers you have, how many of them are unique, then the geography that your users are from. You know, what are your prime profit drivers? What is the site from where, you know, which is driving users to you? You want to know all those things or which is the content that's being played more. So you can actually, you know, figure out the category of funding you want to put on your website because you want to drive your audience. The top part of the dashboard is the KPI, but it's the key performance indicator. Now, when you look at the business aspects, you want to know how am I growing my viewers on a day-by-day basis, right? What's the average day-to-day per viewer? What is the play percentage? You might have thousands of videos on your site, but if what you have come, played for a few seconds or a minute and go off, you want to know because you don't want to hold so much data with you if users are not watching it, right? So what is the play percentage? How much of your content are your users playing? So all such details are available for a part-side view on the dashboard. Likewise, on the quality side, you have the quality summary dashboard where we talk about the KQIs, the key party indicators, what, you know, on average, how long do you want users to be buffered? What is the degree they experience? You know, what is the average start-up time before your content starts playing? And then, you know, various individual metric related reports like start-up time once, rebook once, okay? We'll see some of... We'll run through some of these reports just as a snapshot of what the product does. Like you see here, now we can... This is a viewer-based report where I can track how many of my viewers are, you know, unique and how many of viewers are lipids. Again, this is a viewer-prime on an hourly basis because there are times of the day when you get more users than otherwise. This is a report, for example, which gives you the key traffic drivers to your site, the top titles, the top content category that your users are watching. Ads, you know, ads strategy plays a key role in how you, you know, place your content on the site. So you want to know by placing ads where, you know, what are you going to lose or what are you going to gain? There are a lot of... Maybe there are people who, for example, who like watching ads of a certain category and certain category of ads don't really attract user attention. You don't want to put such ads because you're going to lose an engagement opportunity. Like, you can see in this report, you know, pre-roll ads, ads that come before your video starts playing are the most, you know, proper cases for user-abundance. Then start-up time, as you can see here, see there is user-abundance. As start-up time increases, you know, the moment I hit play for the video to start playing, if it takes longer, like in this case, if it grows beyond five seconds, user has abandoned and moved on to something else. So I'm lost in engagement, unfortunately. In this case, the rebuffer metric, if you see, I'm using... I'm getting my reverse value. See, this is actually plotting play duration at the rebuffer time per minute. If you see, as a rebuffer time per minute is high, when it goes beyond five seconds, the play duration that you use to play the current for has dropped, which means if your quality is low, if the buffering is high, even user-abundance even move on. So you need to do something about your buffering. Then we have the bitrate 1c. This is important for packaging. If you see here, 1,500 Tbps, you know, stream is what your users are mostly watching. So you don't necessarily want to spend on encoding your streams at different bitrates or, you know, resolutions without your users knowing any part. Right? Again, a rebuffer-related one. If you look at here, there is an availability... I think the availability has stayed... Yeah, have I met an availability? Good. The availability of play-to-play machine has played. Moving on to quality. The five pre-primetric of quality are, you know, the audience, the availability, the start-up time, rebuffering and input-to-trade. So this is a real-time system, as in the stream keeps refreshing every minute. You will use the latest metric. What is the rebuffering in the last minute across all my users? For example, in this case, there are 88,000 odd viewers on my site at this minute. And what is the average rebuffering I'll skip through, because probably we're running out of time. If you see here, rebuffering is high at this point in time. Now, what I want to know is the quality has gone down, has it really affected my audience? If you see, there is not much change in the audience trend, which means there was a momentary problem with quality, but then, thankfully, I've not lost my engagement opportunity. And we have a concept of notification which can proactively tell you that something is wrong with the system. In this case, for example, rebuffering is high for a Switzerland. So I want to go and see one after Switzerland which is causing my quality to go back. So end of the day, I will look at the product and have it running for about 100-plus customers. What you have realized is that analytics needs to be unique. So the system needs to be flexible enough. The reports are short. All reports that your users can build themselves. So the metrics, they can be custom-dimensioned and metrics added and reports draw out a couple. Let me just talk about the technology aspects of it now. Analytics platform and it's here in British which allows us to realize some of the features that Prasad mentioned in his talk. So considering that we have very little time, I'll just give a very high-level overview of the platform and we can open up a question after this so that we can have a deeper discussion on topics that the audience will be interested in. So the analytics platform basically consists of the components, the data collection layer, the data processing and the data storage layers. So this analytics pipeline is highly programmable via the user actions at the portal interface. So the main difference that exists between this is just like a build here and the other systems that the previous speakers has spoken is that the needs of our customers, the types of reports that they want is not known either way. We have to build a system that is highly flexible that allows the customers to extract whatever metrics that they are interested in and then show it to them in a very intuitive manner. Therefore, we have to build an extremely flexible analytics pipeline. And as Prasad was mentioning, one important piece of this pipeline is the plugin that captures the events at the client. This plugin captures the events and becomes allowed to our highly distributed and available data collection servers which are all over the internet. These servers can also ingest logs from the media servers that exist at a distance of one half from the end client. These logs are then fed into the data processing system that is horizontally scalable. So as the reports of the needs of the customers and please we can throw in more boxes and serve more customers. We then built an in-house distributed columnar database that can handle highly analytical workflows. I'll be talking about this in a few slides. So as Prasad was mentioning, I will skip over this slide really because this is nothing but how we collect the events from the client, end devices and pick it up to our servers. The logs are into the processing system. We process these logs as a data flow of map reduce operations. We can write these map reduce operations either in C++ or in Python. And we have done a bunch of latent optimizations to reduce the latency of these reports and to schedule these reports faster. So we built this on top of cluster at this. The data storage layer presents the abstraction of data cubes to the portal which is used for reporting the numbers. A data cube is nothing but a bunch of dimensions and a bunch of metrics. By dimensions, I mean attributes like the artist's name, the geography, and time. Time is key dimension that is easy. That is present in all our data sets because we collect time-series data. These data cubes are realized in a distributed columnar DB. The main difference between a columnar DB and a relational DB is that the data for a columnar DB is organized along the rows, along the columns. Whereas in a row-wise distribution, we have the data for one row followed by the data for the next row on the disk. In case of analytics workloads, our customers are interested in analyzing a bunch of columns and a few metrics that are associated with those columns even though they might collect data for a lot of dimensions. So this was one of the characteristics of the analytics workload that we usually see. And columnar databases are the best suited for such kind of problems. But however, writing such columnar data involves updating a lot of indexes when we are writing it. And that is a very hard problem to solve. But we solve it by... We achieve very high throughputs of write by distributing this workload across multiple machines. Apart from that, we build bitmap indexes of the data that we write and we use compressed bitmap to extract out the dimensions of interest during runtime. However, what we realized was when we were reading out the viewer's monitor, the latency of reports is of prime importance to our customers. The previous system that we built could only account rate... could only account rate minimum latency of 15 minutes. This was clearly not what a real-time system can support. So therefore, we had to do certain optimizations to build the real-time systems wherein the data collection layer and the processing layer was coalesced into a single machine on the data collectors. This data goes through a quick transformation process and then is fed into the columnar database. In the columnar database, we have to do a bunch of optimizations around in situ updates and write the columns in a much faster manner. We also had to go the standard route of sharding the data across nodes and the other interesting part was once the data is sharded across nodes, we have to reconstruct the result of a query at runtime. This involves a hierarchical query execution layer wherein one of the queries that hits the data nodes is spread across multiple nodes by building up a binary tree at runtime. Then the evaluation of each of these evaluation of the data on the machine happens locally and the condensed information is bubbled up in the tree to finally show the results. We had to use other optimization techniques such as caching and a query layer to achieve faster outputs. So the system has scaled really well and there are some of the numbers about how the system has grown over the years. You'll see that around this time last year we were doing about 20 million records per day on the system but currently we have doing more than 400 million records per day. This is for the audience analytics model. As far as the purest module is concerned, we were doing about 50 or 1.5 million records in December whereas now it's already at 3 million records per day. So another aspect of the system is that we have been tracking about more than 460 million unique years via the states files that we have in the system. We are really proud to have supported very large events on the internet and these are some of the key learnings that I wanted to highlight. The data model has a very big impact on the discretization of the system and we want to improve the way the data is organized so that our discretization always goes up. I'll leave others on the board and we'll focus on that. We built an in-house columnar dv using the VDD as the underlying data store. So what this columnar dv does is that it organizes the data in a VDD file such that each of these columns has a separate value in the VDD store. We also have bitmap indexes that are built for each of the columns in the data queue. A key aspect of this is that the bitmap index can be highly compressed using runlet conversion and we can do a number of operations on these bitmap indexes without even encompassing the data. For example, if you are interested in figuring out all the rows that came from Bangalore, we have a index which says that these are the rows in your dv that has the data from Bangalore. We just use earthquake and operation to figure out the rows and then do a vector-wise evaluation for the data. So what happens is that when the client digs onto our data collection servers it is immediately transformed by another process that is looking at these logs. In the previous approach that we had taken this beacon would be written onto the disk would be transferred on over the Internet to another machine that is actually crunching these logs. What we did was we ran basically two services the engineering server which is actually receiving the logs and another server that is just crunching this data as soon as it arrives on that machine.