 Hey, so my name is Sanjay and so I won't take much of your time I lead the analytics team in APG and I just want to give an overview of what we do and after this I know there is a lunch coming up and we don't have much time so just bear with me for 15 minutes maximum and thank you Hasgeek for organizing this and two lovely talks for on this area. So what to just give an overview of what APG does APG is somewhat in the system in the middle which connects the enterprises with users and apps and there are couple of entities involved in this transactions which are called developers ops API team who builds the APIs platform API business and the business and IT managers and as these transactions happen what is needed by these entities is to understand what is happening and therefore analytics comes into picture. So I will just give a very short and very quick overview of what we do and how we do it over the last one and a half year to almost two years we have been trying to build this system is not yet matured much but we will eventually go there many of these will be similar to what something was saying and let me let me go through this. So as of now what this is this is the kind of statistics that that I put together they almost about 200 plus customers and this whole analytics has got about 250 plus nodes running all of them are in cloud as of now and the amount of data that we are gathering each day from all these customers basically this when I say customers it actually means the transactions that I showed in the diagram before and we are we are we are getting about 200 GB of data each day and this is this is like increasing almost every day. So we are expecting about crossing by end of 2004 750 800 GB if not more we also provide a free offering which is 90% same features as the enterprise customers that we provide and there we have almost our 10,000 plus customers running the whole system spans across five regions US East West EU South East APEC and North East APEC across the AWS many of these customers the enterprise customers have a multi region deployment of their system and therefore we have to collect data from all of these places our availability goal is 99.9949 we haven't reached there we are trying to get there we also have a similar on premise solution the same solution can be deployed on premise for customers who want them on many of our customers don't want a cloud solution they want to deploy these solutions in their premise so we have to add up to that. What are the key goals? The key goals is the data quality of this has to be absolutely perfect because lot of this depends drives a business decision. So therefore it's dependable data availability of the system scale which are the quite natural in today's world responsiveness to interactive queries this is I will come it's a bit of importance ability to customize whatever reports that you are seeing and you should be very obvious in your when you are showing the visualization so one of the I mean one of our big as I said in that shown in the diagram the business users are one of the big users of that so if you make it complex in terms of interpreting your data and all this then it's of not of much use so it has to be very very lucid smart visualization is definitely one of the thing architectural highlights multi-tenanted distributed zero data loss is definitely the guarantee that we have provided we cannot provide it we provide an at least one semantics so so that means we should not have a zero data loss it's a completely metered metadata driven pipeline so as you can imagine the 200 plus customers on the cloud each one it's a multi-tenanted 250 boxes machine so whenever whenever some new customers are onboarded it's the wiring is totally dependent on the metadata so that it makes the operability much more simple this is very important point is we do not have we have a fixed message payload but every customers can extend on top of that so you can define as a customer you can define what extra data you need and therefore it makes a little bit challenging in terms of how you report how you slice and dice your data you have no control of what those extended dimensions and matrix are offline and near real times are one of our goal and every piece of even though we have an UI there is it's that the whole thing is depend is driven by an API rest rest API base even the UI is one of our client so people or the users or customers can write their own applications to consume from our data so it's totally rest API driven we have to guarantee that the data ingestion that happens in in seconds latency cannot be in minutes the aggregated data today we whatever we have we does pre-computations should be available in a couple of minutes I think right now the SL is kind of a two minutes guarantee and these feeds these pre-computations feeds all the dashboards in the near real time when as a something was speaking when when you get the messages you basically have different pipelines one pipelines for real processing and one pipeline for batch processing typically I will I will come to the diagram typically what we are actually deploying is a Hadoop for batch processing where we do all offline transact offline analysis and then there we are also working on storm for near real time processing of various events so for example customers can come and define an event he is interested in remember this is all it's very flexible so we don't actually know what the customer is going to define so you can define a role you can define an event and that event has to be generated event has to be kind of detected in their data pipeline in the near real time and some alerts or some some decision has to be taken based on that so these are mainly so I won't go into much of a detail these are typical stack generally this is the stack that is followed in all probably all data collection data pipeline analytics stuff we have these event collectors or we also call it data pushers which are basically across all these five regions are sitting there and transferring the messages currently we have not gone to Kafka we are evaluating this Kafka our system has been built out of Cupid's if you do if you know about Cupid's this is basically a MQP protocol and this Cupid's build up the message bus that we have and it also provides all public sub-sub-sub model and what happens is that once the data Cupid is essentially MQP is extremely fast but it has got some I mean less kind of features than Kafka provides so therefore we are trying to evaluate and get into Kafka if required ingestion layer is something that we have written on our I mean it's a custom build and what it does is basically it there are two pipelines one pipeline essentially there are three pipelines one pipeline goes directly to Hadoop for any offline processing and another pipeline goes to a database which is a cluster of postgres machines and where we compute as the messages comes here we compute all the aggregations and displayed from here storm is something that we are evaluating for the real-time use cases redshift is something that we have adopted it's an Amazon service which is basically an OLAP database columnar form the reason mainly because we have as I was saying we have this flexibility of customers can actually define their schemas so essentially you don't know what to build your aggregates for what to what customers will query what reports they want you have no control on that it's totally flexible so one of the thing that we initially faced is that as the data volume grows and you you don't know what to I mean what to aggregate or compute on your query starts suffering because you will no longer actually on the database sites or on any other no SQL sites you will be able to actually handle the queries at a volume of like say 50 million data per day or something like that. So what what happened is that post that we try what try to figure out what kind of solutions we we can have and it turned out that MPP or kind of database which is like if you have heard of names of Netiza or Vertica these are the kind of database which handles this kind of use cases. However fortunately we since we were in Amazon rate Amazon came up with a similar solution for which is equivalent to Vertica and Netiza which is called Redshift much cheaper from that for that is columnar DT DB therefore for analytical queries it is much faster than than what we used to have here so we adopted the Redshift. So generally this is our typical deployment picture this path is still being evaluated and we are we are coming with some kind of solutions here these are like even collectors which sit across all 5 data zones all 5 zones I mean sorry regions in Amazon and this is the message and goes via the message bus to the ingestors I haven't actually drawn the message bus this goes for any offline analysis and from there it goes directly to the Redshift where we store this is all of this is a batch pipeline even load to the Redshift is not a real-time one it is it is basically batch so periodically you dump data load data and Redshift this is streaming we get into the PG and so far it has been working excellent for us basically you stream the data and we don't have an index on any of these raw data table therefore it is almost as good as sequential right once we get the data we periodically I think currently around every 2-3 minutes we compute all the all the aggregates and those aggregates or the dashboards that I was talking about was actually fed by this data for the real-time analytics we as I said we will we are evaluating storm in order for any of the event detection mechanisms that we want to do so generally use cases for all the offline things are typically being in this pipeline and it is like things like if you want users customers can come up define rules for example they might want to see what is what is your trend of what is your today's traffic in comparison with last 6 months of average or what is the moving average or various other metrics for example there can be errors there can be latency matrixes there can be business matrixes in terms of revenue and all this so which requires an offline analysis and apart from that we are also using that pipeline for any of the our data sciences activity for example you can you can detect any kind of anomalies or you can detect do any kind of principal component analysis or any any similar things which requires machine learning so one of the things that we are we are using the building the I mean evaluating the Mahut library if any of you knows about it yes so it is basically the transactions it is today we do not capture the payloads of the transactions we capture typically headers as well as if I go back to this to this picture so as the transactions come we have a logic that sits here and it runs this logic can do for example it can avail sort of lot of services for example authentication service quota service cash service right and therefore you are able to identify what has happened in this request and response flow right now as soon as that happens this data is collected along with some of the context data of the transactions these data now flows into the analytical analytic system the data is in key value map obviously and protobuf to the analytics where the initial layer extracts and send it to the downstream did that answer your question these ap these ap is basically we are if you think about this is a proxy so whatever the back end system has the logic is belongs to the enterprise we are just exposing the services when somebody goes through this and talks to our services this service you this service in turn actually calls the actual APS so we have no restriction on what API transactions happen it totally belongs totally depends upon what services the enterprise has to provide this it can be retailer it can be banking anything health care anything yeah so I was talking about the offline processing that we do basically as I said on the redshift pipeline that we have we periodically upload data to redshift redshift being columnar format the IO is definitely much much less for all analytical queries the every so we have something called a custom reports where base because as I said the schema or the data that somebody some customers API transactions are sending they can be customized so it varies from customer to customer so we have as let us say a default in fields right another customer can come up and say I want 10 more data and this can be different from this 10 data can be different from another another customer's 10 data so it is very difficult for us it is it is not it is actually impossible for us to know what kind of queries customer want or what kind of reports customer wants right so therefore there is a new case in which a customer can go and look at his data set and define a set of reports which he is interested in now this reports can contain data about the extended column set extended schema set that he has right that is why it is not possible to index because we do not know what to index so that is the reason we send it to the columnar database and on interactively we query based on the based on the request for the reports these are these are typically any any rule customers can define on their on their systems and these rules are as I said analysis of maybe the revenue maybe the operational details over last 6 months with the current and whatever that rule he wants it it goes and notifies the customer for example one example might be the back end system latency the back end system is owned by the customer not by us so the latency of the back end system has suddenly spiked compared to the average of last week so if he comes and define that we process this data and send the notifications to the customers in real time this one is not real time the other pipeline is going to be the real time very look at it very small period of data so these are some of the these are some of the reports driven by the batch system or offline system this is interactive but I am calling it an offline because it is this reports do not appear I mean they are they cannot be fetched in a in a real time or a near real time there is a lag because it it happens from the raw data and therefore there is a lag on uploading the data to the redshift and then getting it out from from the columnar database near real time use cases say all of our there are there are several number of out of the box dashboards we have those are all handled by the pre-computations that we do every two minutes every three minutes I think and so the latency is basically about three minutes to get the dashboard even detection and alerting is something that we do on the on the streaming model there is we are also thinking of thinking of extending this for example I showed the diagram in which the APG sits in between and then you have an analytic system now one of the use case that we are trying to think about is how do we how does analytic or insights into the insights in this data feedback into the API platform that we have and control that for example you can tune the cache right there is a caching service over there what if you identify what is the bottleneck what is happening and tune the cache right or tune any throttling mechanism right so that is something that will probably be on the real time platform because it has to be or for example can you actually increase the number of machines auto scale up based on some of the data that the analytic system gets in real time so evaluate the data in real time maybe in the last five minutes of data and auto scale set of servers immediately so those are some of the use case that we are trying to handle on the near real time as of now these are the dashboards that all are fed by the real near real time would not call it a real time which basically talks gives you all the operational details for example this is a this is a set of graphs which talks about enterprises back ends errors and latencies how does your back end performs because it is very important for them to make sure that their system is up so that all the transactions that are happening finally end up in giving proper result right so so this this dashboard actually shows the details about that this is a normal dashboard which shows generally about a trend of traffic how many developers are building for example any retailer may be Amazon some lots of developers will be there who can write apps Amazon or Flipkart or maybe some banking applications so this shows what are the what are the apps and which are the developers app that are currently transacting or a given period of time so all of these details are actually these are kind of a near real time within two three minutes it comes and these are operational details that the enterprise customers are interested so I didn't go much into any architecture because that was not my goal we generally had I just wanted to give what we do at APG and we have been trying to do this for for all the cost all our customers that are getting on boarded with APG and over the last years the number of customers have actually ramped up and as you can see I mean we have actually grown into a five different regions it's about in total if you see recent we I think we have reached close to 2000 TPS over the whole system which amounts to about I think 170 million transaction a day so all of these data are basically coming and flowing into this so that's all I had I just wanted to give an overview of what we do PCA is basically so base so that no this is being work worked out basically see the reason the reason yes we are trying to evaluate what works idea is basically see if you if any customers 100 dimensions right or 100 dimension matrix right when you what are the most or the best set of dimension which impacts him right right so that's where we come so we are using 10 years now so the money the multi-region setup this data from as shantanu was saying right from the messaging system we send it to the other cluster we don't take it to the store and then copy so basically there are two parallel active active setup and each one actually to do their own analysis and based on the disaster or whatever that happened data center goes down we switch yes yes it does there are lots of both and other variations are there yes yes I mean there are definitely there are definitely we can talk about detail offline I think and there are experts on these areas I can connect with you but there are issues I mean there are any more okay thanks a lot.