 Hello hello, check one two three four. Hello hello, hello hello Hello hello, check one two three four. Hello hello, hello Hello hello hello, hello check check hello, hello Hello, hello anyone ready to listen to a story So the mic is fine. Hi good morning Hello, check. Hello, hello, hello Hello, hello, hello, check. Adam, bootstrapping a deep nn based sequence labeling model with minimal labeling. Solving vehicle routing problem for optimizing shipment delivery. Check, check, check. What do you mean, it is like almost... Okay, I will leave once I clear it. Hello, hello, check. Hello, hello, hey Laila. Hello, hello, hello, hello, check. Check, check, check, check, check, check, check, check, check. Hello, I am Shashank Jaiswal and I am here presenting Adam for fifth element. Clear? Can you speak now? A very good morning to everyone present here. I am Harshita, your hall manager for today. Today in the auditorium too, we have a talk which is followed by a tutorial and then we have a break. Later we have a talk again which will be followed by a tutorial. Post lunch, we have a tutorial after which we have BOF sessions and flash talks. We have a few announcements. Kindly turn off your phones or put them on silent mode because we do not want phones ringing and disturbing participants or speakers. And we have Wi-Fi available by the name Hasgeek and the password is G-E-E-K-S-R-U-S. G-E-E-K-S-R-U-S. And we have food court available which is next to the auditorium one and we have flash talks in auditorium two which is here from 4.55 pm to 5.30 pm and we have seven slots each of five minutes. The flash talks, we need presentations without any laptops. It is going to happen by the participants, right? So the laptops are not allowed. The laptops are only allowed if you have a demo to show if not the laptops are not allowed and no hiring or any production pitches during the flash talks. And if you are ready to give a flash talk then you can write your flash talk topic with your name and phone number on a slip of paper and hand it over to the hall manager that is me, right? So if you are interested in giving a flash talk you can write the topic, your name, your phone number in a slip and you can hand it over to me. And we have a Redis AI tutorial in auditorium two starting at 1.55 pm. So make sure you have installed the software required for attending the tutorial. And we have BOF sessions running in auditorium two and auditorium one. Please check the schedule which is given to you during the registrations. And yeah, that's pretty much it. So now we have a talk, the first talk for the day by Shashank Jaiswal on Adam bootstrapping a deep NN based sequence labelling model with minimal labelling. Thank you. Hey everyone. So I am Shashank Jaiswal. I'm a data scientist at cluster and I'll be presenting here Adam for 5th Elephant. So Adam's chance for attribute detection and annotation module, which is just a fancy name for named entity recognition. Now we pitch this name because we also had an e-project coming up at our company. So well, you can understand why. Interestingly, we are also the first talk here at 5th Elephant. So coincidence maybe so as I said earlier Adam is an advanced named entity recognition module for domain specific data designed specifically to deal with the absence of label data. It leverages concepts from weak labelling to deep sequence models and the concept of active learning. So to understand anything what Adam does and how it is formed, let's first understand what named entity recognition is. So in a very layman term, if I speak, it's just actually picking out important stuff from a sentence like a name of a person or the position or an organization or it can mean something really different in different domain. For example, in hospital records, symptoms, diagnosis or diseases can form named entities. Now imagine you are trying to make an algorithm or develop an algorithm which can predict some diagnosis given some symptoms. You can imagine the importance of such tools like Adam in that domain. Similarly for product descriptions, they play a very important role in e-commerce and other businesses like these because they are loaded with attributes which are used in enhanced search and semantic abilities in various product catalogs that you see. So attributes like brand name or the category of the product and various other specs like nutrients for example in this case. So similarly, what will Adam do? Adam is basically as I said earlier it's an advanced named entity organization module. It will do exactly the same thing which the examples that I showed earlier but it will do that for product titles like this. I picked this one because you might be a little more interested for this. So we at Cluster are associated with Tali as many of you might know is an accounting software. Now we get our data from millions of businesses here and mostly small and medium enterprises which are present here in India and select other countries. So one of the data that Tali stores is actually inventory which is required for bookkeeping. So what that inventory is formed of is each unit data that is present in inventory is these stock items. So what will Adam do? Just pass these stock items to Adam and it will resolve them. We use the standard BIO tagging method to extract elements from stock items. So you can see that the tokens resolute and black are tagged as BB and IB which means beginning and intermediate of the brand. The word vodka is tagged as BC which means beginning of a category and similarly you cannot see here but 180 ml was tagged as Oh which means others or not relevant attributes. So that's that is the whole objective of Adam. This is what Adam will do. Now, why is it important? Why is the module like named entity recognition important in product titles domain in the domain in a business domain? So there are several use cases. One is universal product catalog. So you often see product catalogs in websites like Amazon or Flipkart or even specific retailers like big basket. Now these product catalogs have enhanced capabilities where you can filter your search depending on your attributes. You can filter your search based on the name of the brand or other specs like whether you want your earphones to be wired or wireless. So you can understand the importance of these attributes in a catalog. Now imagine if you have a way to extract these attributes from the product title itself, you can actually automate the whole enhanced semantic search that is present in these catalogs. Similarly, aggregation and market analysis lift inside demand inside popularity of a brand in a given location penetration of a brand in a given market or competition for any certain category of a product prize distribution, etc. All these insights can be heavily impacted by a module like Adam. Third is knowledge graphs. So knowledge graphs are formed using the named entities that are present in product title or some other data source and the relationship between them. So all the nodes in the knowledge graph are actually formed using the named entities and the edges between these nodes represent some kind of relationship between them. So if you can extract these named entities from various data set that you have, you can actually have a breathing evolving knowledge graph which can consume the data set automatically and can grow. So clearly we were not the first one to think of this. There have been many previous approaches to this. Traditionally, people have used CRF algorithm with handcrafted features or have used your healthy knowledge based like DBPD or Wikipedia, etc. to do these exact same thing. For example, one of the paper in Walmart labs is by Ajinkya More who tried to extract the attribute specifically brand from the product titles of products present in their Walmart's catalog. So here is one of the example of the handcrafted features that they used. So they use a feature like the word itself, lemmatization of that word, neighboring words, whether the word is alphanumeric or not, and some features like these which were you choosing to extract attributes from product titles. And you can see that it actually worked really well for them. So this is one of the example that you can see and they were able to extract brand from that pretty nicely. But when we tried it on our data set, it was a disaster. The reason being our data is extremely variant. It's acquired from millions of sources and there is no check between how do you write a stock item or how do you write a product title. So the problem is it couldn't perform very well on our data set that we had. Although it worked really well for very famous products like Maggie and Weed, et cetera. And there are some other works like, there are many of the shift tools like, the very famous one is Chanford's NLTK, which is just like a plug and play. You can directly download the resources and start playing with it. But the problem is they were trained on natural languages. Languages which have proper grammar in it. So for example here one of the sentences Barack Obama is the next president. It's a pretty old sentence, but it will do. And you can see that it extracted the name of the person and organization pretty well. But the problem is these attributes were not required by us. These are very irrelevant attributes to us. And you can see in the next example, it basically was really poor. Everything was tagged as noun and it couldn't actually differentiate between anything. So that's why you couldn't use any of the shelf tools. Then comes the state of the art. Paper by Lample and Zuzma and there were many other name that I couldn't pronounce. So these people have worked on really complicated neural network architectures including by LSTM, CNN and CRF to produce an accuracy for such task close to 90 to 95% which is amazing in this field. Deep learning is actually taken a big leap from traditional algorithms, but the problem with them is they either required a large label data set like the paper by Lample and Zuzma they showed that they required around millions of sentences to reach conversions. But so that wasn't the case with us. That is not the case with most of the industrial data that is present there specifically in textual area. So and other people who claims that they have done it in a very less amount of data set are very domain specific. Like they have a very narrow segment. For example, the paper here by Open Tag was actually they showed the result only on specific segment like dock foods which you can understand how narrow that segment is. And the problem is we didn't have that information. We don't have a data segmented already so that we can work on them. So we had to think of something else. Also, there are some additional challenges that were taken care of in any previous paper. So because as I said earlier, we get our data from multiple sources. So there is a lot of variance in our data set. For example, this is a proper written stock item which is very easy to understand. This one not so much here and actually stands for Nivea and talc is actually for form it's short for talcum. But okay, it's still workable. This one is horrible. It's not the BMW that you guys are thinking of. It actually stands for basically mineral water which we could understood only because we found out which who was the retailer of this stock item who actually holds the stock item. That's how we actually get to understand them. And then there are some other problems like the data set is extremely noisy. The spelling of Sun Drop and Heart is terrible here. So these are the problems that we had to face and that's why any traditional approach couldn't scale and all the state of the art required a large amount of data set that we didn't have. And please note that BMW is actually written in red because we don't claim that we can solve them by today. So here's what we need. We need a model that can predict tags relevant to our domain like okay for this session will only focus on two attributes brand and category. So we need something which can extract those. We need a model which can scale across various segment. We cannot have a model which can be very good in one of the segment and terrible at other segment. We need something which can scale across all the segments that we had. Then we okay we are not claiming that we'll show some crazy unsupervised algorithm here that can predict attributes without a supervised learning. So yeah so there is so there's a fact that we require some label data to work with but what we are claiming is that we'll reach that point will reach that maturity with the minimum amount of label data that is required and there is actually a significant gap between the amount of data that we required to scale and the amount of data that is presented by many research papers. Also we need an approach that can scale on streaming data. So that's a specific case for tally will not get all our data at once. We'll have a streaming batch system. So we needed an approach which can scale every time there's a new every time a new batch arrives. It shouldn't happen that I need to take a big dump of sample from them and then get them manually labelled to scale which could be really costly. And then the last point is pretty obvious. It is needed obviously. Now let's take at the deeper look at the architecture and the training process of Adam. So I presented Adam as a data science problem till now. Now let me present Adam as a machine learning problem. You provide a sequence of input to Adam. Please focus on the word sequence. It actually I think you can understand what here by sequence I mean but please focus on that that would be helpful in later slides. You just pass it to Adam and what will Adam do it will predict the tags corresponding to individual tokens here. This is pretty obvious. I think you can understand that going forward. So here are the approach that we took to train and design Adam. It is actually divided into three crucial part. The first one is weak label generation. So we had nothing to start with. We had actually zero label data all the data that we had was unlabeled but we had a small small amount of knowledge base. Knowledge base consists of dictionaries of these attributes like we had some amount of brands. We have a few categories and other attributes like flavors and color etc. So what we did is we use them and then we took a lot of stock items and then used a complicated string matching algorithm to automatically label them and they had to go through a quality check after which we received something like this. So this quality check was necessary so that because there is a lot of ambiguity that was produced using this approach. For example, if let's say if Apple is present as one of the token, it can be tagged as both brand and category. So there were many ambiguity that was present and this approach couldn't resolve them. That's why there is a quality check in between which remove any result with such ambiguities. So we passed around one million such stock item through and only 8,000 were able to pass the quality check. You can understand how weaker knowledge base was to have this result. So we started with around 8,000 weekly labeled data set to move moving forward with. Now let's talk about what goes inside Adam. How how is it? It's architecture. It comprises of again three crucial layers. First is word embedding layer. The second is by the stream layer. I will explain these layers briefly in the next few slides. And this is what Adam is supposed to do. The first is word embedding layer. Now for the people who are not familiar with the concept, word embedding layer is basically you project all the words in your vocabulary into an n dimensional space where each word is assigned a specific coordinate. Now this the relative spacing and the relative positioning between these words holds some semantic information in this n dimensional space. And that is what word embedding layer tries to preserve. There's actually a very famous pre-trained word embedding set called globe, which is made by Google, which was trained by trained on millions of books, millions of sentences from various ebooks or PDFs or whatever data they can get like news articles, etc. But the problem is we couldn't use them. We couldn't use them that mighty massive word embedding layer from them. The problem being the domain that we were working on is very different from the domain it was trained on. So most of the word that we had in our system was actually out of vocabulary for them. So we couldn't use them. Hence we had to create our own for that. The data set that we used is we used all the stock items that we had in our first iteration. We added some product titles from publicly available catalogs like Amazon. Amazon actually provides you with all the catalog stock items that they have. So you can play with them. Then we partnered with some data providers like GS1N. So I wouldn't dig deep into what GS1N is. You can just understand that they provided us with two or three million stock items from specific segment, FMCG2, we precise. So from that we gathered around 13 million titles. And then we used a skipgram algorithm. Thankfully that algorithm was unsupervised. So we could we could use all the data points that we had in an unlabeled manner. So and I wouldn't dig deep into a skipgram algorithm earlier. I have cited that paper and it would be present in the lengths of the proposal that I have can actually go through it. It's a great read. So here's one of the interesting insight. So I tried to project 128 dimensions into two dimensions so that you people can perceive it. But clearly we can't because it's gibberish. But in the right side, right side. Yeah, sorry, left side my left side anyways. So in the left side, you can see that the word Samsung if you search for the vector of the word Samsung and then you see the nearest neighbors for them. It's actually Galaxy S3 S5 S6, which I think most of you will know is some variants of Samsung mobile phones. So you can see that the semantic the position of these word embeddings do hold some semantic information. And we try to leverage them. Second comes the bilis chimney. So bilis chimney is another word level encoder which leveraged a sequential information to encode the tokens of the sentence. Now, as I said earlier, you can imagine the tokens in sequential manner where first token arrives at even time steps. The second token arrives at T2 time step. So at any time step T, not only the input is that embedding of that token at time T. It's actually some context from past also. So some as you can see in the diagram, some context is passed from previous time step at every time step. So this is how it tries to predict some output based on some previous context that was passed along. Now, why was this necessary? Why was this another word level encoder necessary after word embedding? So you can see two sentences here. Both of them are formed by similar word, but only one of them makes sense and the other one not. If you look at second one, it's properly written. The brand name, the brand comes first. Then the category, then unit of measurement and the measurement itself, et cetera. But if you look at the first sentence, it's actually just a jumbled word of first form. And it starts making very less sense compared to the first one. So this is the information that LSTM can hold, which the only word embedding layer could not. And that's why by LSTM layer was used to leverage it. Then comes the third layer, conditional random field. It's the old traditional CRF algorithm, but it's now fitted over top of two heavy neural network layers. So again, this is also a sequential algorithm which has similar concepts like by LSTM layer, but it had, it had one more advantage and that's why we use them. It also, it is the only layer that leverages the output label context. For example, the sequence of output should also matter. By LSTM layer was not taking care of that. But this layer does take care of that. For example, the I tag should never, so the O tag should never be succeeded by an I tag. It has to be a B tag in between. You can understand that. So this is how CRF exploits that information. So it tries to learn a transition matrix. Now what this transition matrix is, is all it does is tries to learn the cost of transitioning from one tag to another tag. Like for example, row zero column one holds that transition cost from BB tag to IB tag and similarly everything else. Now you can understand some of these costs should be really low. Like you'd often see IB tag to be succeeded by, to be preceded by a BB tag. But it would be very, it's actually it should never happen that an O tag is succeeded by an IB tag. There would be a BB tag as I said earlier would be in between. So some of these value would be really high. Some of the cost would be really high. Some of the cost would be really low. And this is how it gives the importance to the output context of the algorithm. Then comes a loss function. Now I understand people who are not from the mathematics background. This equation can seem really scary. Trust me, people who are from mathematics background and don't know about this can be scary for them too. So instead of directly jumping into this, I'll first explain this with an example. So here you can see that in the bottom there's a sequence, a sequential input is given and in the first column it's actually all the tags that are possible. Now each column that you see holds the value of probability of a certain token being a certain tag. For example, the probability of the token Maggie being BB is actually 0.80. The probability of noodles being BC tag is 0.71. And the scores above each arrow that you see is actually the probability of the transition. For example, the probability of transition from BB tag to BC tag is actually 0.7. So this is how you calculate the total energy that is required for a given input of stock item to attain a particular sequence of tags. So you can see how total energy is calculated. It's just the summation of log of probability values of all these probabilities. So for a particular sequence, you'll get a particular energy score. Now ideally what should happen for the correct tag for the correct tag sequence? The value of energy should be really close to 0. It should be really low. The magnitude should be really low. And this is the basic loss function of the module. This is what the model tries to learn, tries to minimize for every correct sequence. And this is basically what this loss function tries to extract. The energy equation that you saw is actually these summation values. So the function here you see that is the function that generates those emission score. Basically the probability score and transition score. And yeah, E is just for IS time step for something. And lambda is just a trainable hyperparameter. Now ZX is a normalization score hits. Actually, so there can be for any given input sequence there can be length to the power n, possible sequence, output sequence possible. So we try to normalize that score by summation of all these values. Because the output energy is dependent on the length of the sequence. So we need a way to normalize that. Hence we used a normalization value. And this is a very standard process for any sequential algorithm. It's actually likelihood. This is how you calculate the likelihood. So that's what we used. Now here are some of the result. Now this is the result with only by list GMS final layer and not CRF layer. So without CRF layer, you can see that it's still predicted okay. Output. I mean, for example, in the first example, it's not wrong to say that G is actually category and bread as a brand. But when we use CRF layer, it actually the smoothness increased. Kaugi is actually more relatable category here. Similarly Patanjali Kejkanti is the complete brand sequence and something else. So that's why we used CRF layer at the end. Now people who have worked in this domain can argue that by list if you had used a thicker by list chain layer, it would have given me a similar result as with CRF. Yeah, and I wouldn't argue with them. They are absolutely correct about it. But the problem with that is if you are using a thicker layer, you need more training data to reach same amount of result as with CRF. And that would deceive the whole point of what I'm trying to make that of minimal labeling cost. Hence we tried with CRF. So after doing all of this, here is some baseline result. So the accuracy of predicting brand is specifically was 32%. For category it was 36%. And for a complete sequence match, it was 22%. Which I wouldn't shy of saying is absolutely shit. So yeah, we obviously we couldn't present this anywhere. But it gave us something to start with. It gave us something to start the whole active learning process. Now one limitation with our automated weak label generation processes is that it is constrained by the completeness and quality of knowledge base that we have. And we had a poor one. So from those we were able to get only 8,000 week label data set. And then we estimated the set at least which was required to achieve conversions. It was around 1 to 2 lakh or even more for only the first iteration of data points that we had. So and not only that, we needed an architecture that can iteratively scale for streaming data as I made my point earlier. We need an architecture that can scale new data coming in, coming in. So our labeling cost had to be minimized to achieve scalability. So here's what we did. Now this is what we are left with after the first two approaches. We had a small amount of label data set from weak label generation. We had an abundant amount of unlabeled data set. And we had a deep neural network module trained for us. What we did is using all the three, we developed an extrinsic sampling algorithm which can sample a small amount of data points from unlabeled data set. So and the sample data set that is extracted here is actually cementically very different from the label data set that we had already. So when we manually label them, use it again to retrain the model. What will happen? The increase in models accuracy would be much better than if it would be much better because the new sample data is not redundant to the label data which was already present. So that is the aim of active learning. You choose your sample consciously for every iteration so that with minimum amount of label data, minimum amount of new label data, you can achieve high accuracy jumps. So this is what we did for few iterations until your model reaches a certain maturity. It can be some threshold percentage or you can see that after sometimes there wouldn't be so much growth in your accuracy. So now this block, this extrinsic sampling is something noble and we tried it for the first time. It hasn't been tried before. So what we did? We had a list of all the data points that we had. Now the first aim was to project them in an n dimensional space again. So we needed something like product to embedding, a product title to embedding conversions, some algorithm to do that. So for that, what we did is we used the bilestime layer from the architecture itself to project each of the product title into an n dimensional space. So here you can see that every dot that you see, every black dot that you see is actually one of the data points and the colored ones are actually the ones which are already labeled. So this is how we projected in an n dimensional space, 200 dimensions to be precise. So next step is we tried a simple k-mean clustering for this. Now I understand people here would think of the dimensionality curse. If you are doing k-means clustering with 200 dimension, it can be poor clusters because well, it's very hard to cluster in so many dimensions. So the problem is we were not looking for very precise cluster. We were looking for loose cluster because we had to just pick up samples from each cluster anyway to get them manually labeled. This was an auxiliary model which was helping, helping our deep sequence models. So that's what we did. We did a simple k-mean cluster. We got a few clusters. Now next step is simple. We had to pick a few samples from each cluster, each cluster which would aggregate to a certain number, for example, 1000 for each iteration. So and I'll go to pick it up was very straightforward. We just used the, so it was proportional to the strength, the population of data points in a given cluster and inversely proportional to the data points which are already labeled in that cluster. So for example, you can see from the last cluster, you'll pick most of the data points to train your model. And this is the complete Adam training pipeline, right from the first block describes weak labeling. Second block describes the neural network model that we were talking about and the third block describes active learning. So this whole process was iterated for four to five iterations at least and then we have some fruitful results that you can see here. So here in the results, you can see most of the products here are very different to each other. For example, the first one is from, it's a fruity jelly book. It's actually a kind of candy. The second one, I don't know what that, it's I think from pharmaceuticals. Third and fourth are from cosmetic department apparently. Yeah. So and there were many unpopular products here and you can see that the model was actually could scale in that one enormous FMCG segment that we were working on. And here are some not so positive results. So we don't claim that we reached 95% or 100% accuracy because trust me, our data is horrible. And so there were some very uncommon sequences that were present and the model couldn't scale on them. It was very hard for the model to predict on them. So yeah. Now here are some mattresses. I can better explain this with the graph. So the orange line here you see is the improvement in complete sequence matching accuracy that started from only 0.22% and it reached around 77% and I haven't included the results from last iteration and in that it actually reached 82%. The final accuracy for complete sequence match was 82%. And for individual accuracies like for brand and categories, it actually started from 0.4 and 0.39 respectively and it actually reached to 89 and 86%. So these were the highest level of scores that we were able to reach with only four iterations and at every iteration we only took 1000 new samples. So with the 8000 previous week label data and only 4000 new samples, we were able to achieve this massive jumps and higher accuracy. The last line here actually you see is the baseline traditional CRF model. We put it there to compare with the model and you can see that the model couldn't scale with the variance that we had. So these are the proof that deep sequence model can actually generalize much better because the first layer, the embedding layer, you don't actually create any handcrafted features. You extract them based on the context where the words are present. So that's why it was able to scale on so many different products. So here is our contribution. We proposed a novel entity extraction model for domain specific data. There's nothing that hasn't been done before but to this extent with so much variance in your data, this was the first time that it has been done. And then our models were built upon the rate of the deep learning and CRF models and then it further leverages weak labeling and active learning. So there were three crucial components that came all together to achieve these results. And the extrinsic sampling technique that we described earlier was actually novel. This was the first time that something like this was tried. So these were the main contributions from us. Here are the key takeaways. So deep neural networks are at the buzz right now. You can see them everywhere. It's actually almost a solved problem for autonomous driving. There are products out there. But it still hadn't made quite the cut for NLP. It's because the problem with the label data, the data that we have in our industry grade is so much poor compared to the data that people experiment on in labs and for research purpose. So that's what we so that's what we try to show here. We showed how you can make an industrial grade information extraction module. So that is our key takeaway. Another key takeaway would be so active learning is a very fairly newer research now. So I believe that it can be very important concept in industrial domain and real world scenario because yeah, trust me the data that you see in real world scenario is far worse compared to what you actually see in research labs. So yeah, that's that's a key that's another key takeaway. Now here is some future scope for Adam. Now currently we are working on extracting higher order attributes. Now, for example, in the first example, you see that ideally currently Adam would pick good night and advance as the brand name, good night advance as the brand name. But at high level, we know that good night is actually the base category and advance in some variant base brand and advance is one of the variant of that. So that's what we are trying to predict. That's what we are trying to further divide these extraction information. Similarly for categories. So in the second example, you can see that cookies cookies are actually base category as you and that's our flavors or ingredient. I don't know. And then we are trying to scale across different segments like FM CD. Now we are currently working on FM CD department and will apply similar approach to that also. And then there are other segments also like automobiles or hardware, et cetera, pharma maybe. So yeah, these are the papers and inspirations that we got that I would specially thank my teammates, Abhishek, Deepak and Ashish to help me achieve the results that we achieved. Ashish and Deepak were great mentors and Abhishek was a nice teammate to have. And then we then I would also like to thank DataOps team. So DataOps team were the one who actually did all the manual labeling work for us. So yeah, a big thanks to them. Yeah, that's it. Thank you. Any questions? Yeah, yeah. Facy also has NER model. Sure. Because Benchmark is Facy and what I've found that they have moved from biolistium to convolutional layers. So have you tried convolutional layers? Yeah, I mean I understand your point. And there is one experiment that is going on with convolutional layer. We are not using a spacey exactly. We are using PyTorch, but the algorithm is same. And the results were very similar to that. So we didn't move with CNN layer. See the problem with them is CNN layer is again you cannot train it in an unsupervised way. CNN layer comes at very starting with characters. So the what problem that occur if you have very less amount of label data, it would massively overfit to that domain. So you cannot move with. So that's why our first layer, the embedding layer that I was talking about was an unsupervised. It was trained in an unsupervised way and it was an it was a frozen layer when we moved with the architecture. So the with CNN again, so if you used any complicated structure in the first, the problem is again you need a large amount of label data and you wouldn't be able to, it would massively overfit as I was saying. So that's why we didn't try because we are working with CNN layer for character embeddings also. We are working with similar algorithms, but it shows this was the only approach, the simpler one, which was actually able to scale better than compared to them. I hope that answers your question. Any else? Yeah. So the thing is that in the whole active learning system that you have developed, is there any framework for doing this or you built it yourself? We actually built it ourselves. There is, so we don't need any, we needed one framework for the labeling part actually. So we actually created N house, but there are tools like SageMaker from Amazon and it cost a lot actually. And when we started this, it was an experiment. We didn't know that it might succeed as well. So we started with an N house way. So the SageMaker has an active learning module. So they have an annotation system, annotation module for you where you can go and manually label. So you can fit your algorithm, you can change the algorithm of sampling there. You can fit it with their framework. You can fit your whole data. I mean, they can read and write data set from S3. So yeah, I mean, you can make that module for active learning. You can customize it for active learning. Another thing about that, excuse me. Someone else has a question. I have a different question actually. No problem, I'll discuss. It is about the K-means module. Actually, I'm also working on active learning. Sure, we can discuss about it. No problem. Anyone else? Yeah. It was really a very nice talk. Thank you. Specifically like the active learning part of it. Just one question maybe I missed in the start. So the accuracies that you were calculating. So those were with respect to the weak labels that you created or were those with the actual labels that you had 8K samples? No, no. So they were, so we had a separate hold-out set where we calculated the accuracy. They were manually labeled. They weren't, yeah. So we calculated all these accuracies on that. And also those hold-out set were pessimistic. So all the brands and categories which were present in the hold-out set we made sure that they weren't present while training so that we can get actual real-world scenario numbers. The reason why I asked is because that would have been a very big covariate shift in terms of the real-world use case. Yeah, yeah. So that's why we had a separate hold-out set. It was never used anywhere else and it was manually labeled. Cool, thanks. Anyone else? After the K-means algorithm that you used, do you have any idea how many clusters were actually not mapped at all where you probably did not take any words for probably manual labeling? Exactly. I didn't have any idea about it because I wasn't bothering about it initially. I thought eventually it will get picked up. But yeah, there were some numbers of a few clusters which were, so because I was picking numbers from every cluster or any cluster which is of significant amount you definitely get samples picked up from them. Your algorithm works that way. Only there were some few left-out clusters which had very less population in them because your clustering algorithm was proportional to it. So I wasn't worrying that much about that initially. Because it would lead to the long tail of... Yeah, yeah. I understand your point. But see, so the problem is those cluster which are getting left out and you won't actually train on them, what will happen is because the amount of data they present there is so low that initially I wasn't worrying about. See, I was talking about iterative data that will get right. Eventually what would happen that it might be possible those clusters get increased in number. So what will happen? Your active learning will automatically take care of it. Your approach would automatically take care of it. Anyone else? Okay, since we have time. So other thing is that in terms of K-means clustering approach that you're using, the priority of each cluster is something that affects your labeling process. Yeah. So have you tried any other approach? Yeah, there was two approaches. One was intense sampling where we use the model itself to check how the confidence score for any output. We ran it on all the unlabeled data set that we had and then we see whether the confidence score that is generated by the model is very low irrespective of what the output is. If it is very low, it means that model is unstable for that. But the problem with that was the score that we were getting was very random because in starting of the model it was very different. That's why we couldn't use that. Currently I'm working on a hierarchical sampling approach for this. Okay. So we can discuss that. Sure, no problem man. Anyone else? Okay. Thank you. Thank you for your time. Thank you Sandeep for the wonderful presentation. Shashank. Oh, sorry. Shashank. And next I call upon Sandeep Kurana. He's a research scholar and a data scientist at ISB to talk on the topic core concepts in social network analysis. Welcome Sandeep. So welcome everyone. Am I audible at the back and too loud, too low or is it okay? Okay. Social networks, firstly, I mean good thing, bad thing. It is not a new thing for us. In fact, many of us have been through the evolutionary cycle of being enamored by them, the craze of joining, adding friends and many of us in recent times have even quit social networks. So where do you all stand? And yeah, one important thing. It is not just that you have to be on social network. You have literature coming up, a lot of popular books now in the power of introverts. So you need not feel that if you are not on a social network, you are getting left out. You should read books by Susan Kahn that is about the power of introverts and also Deep Work by Cal Newport, both of whom say that to do something really deep and meaningful you may possibly have to remove these distractions or remove this information overload or the relationship overload and then the introverts do have their own power. So no matter which end of the spectrum or where in between or where in your own personal, say personality, development life cycle or social media presence or absence life cycle, where you stand, that by itself will not be an indicator of your success or failure. But more important to be yourself, that is what everyone, all these experts would say that as long as you are your natural self, whether you are on introvert side or extrovert side and there are a lot of myths also around the introverts that they don't mix around, but the reality is that they have their own preferences and ways of meeting people and they have their own, say, few but more deep relationships. So both kind of, say, personality types are good. There is nothing wrong with either of them. But this talk is not even about that. This talk is about everyone in front, which means whether it is an introvert, whether it is an extrovert, whether it is heavily extroverted or mildly extroverted, someone on social media or not on social media, it is very difficult to escape networks. It is very difficult even if you are not embedded in a network to not be influenced by it. So this talk is not about your presence or absence on the network but this talk is about under the hood what happens. When you are or are not on the network, then what you can do using analysis to make certain inferences. How many of you have been introduced to the subject of social network analysis? Please raise your hand if you have been introduced to this. So good. Then my talk is tailored appropriately except I think I saw only two hands. Rest of you are pretty new to social network analysis. So I will try and motivate the talk first. In session one I will give you some interesting examples where you see the power of social network analysis. The world is getting more complex. The networks are increasingly there everywhere and it is important that you understand the power and also the danger of social networks through the lens of social network analysis. So that is in section one I will do some motivation for the talk. In section two I will introduce you to the terminology. So that after this session I do not get a bad name that you attended this session but you do not know the basic terms. So I am pretty sure in the conversations that we are going to have after this session you all are going to use those terms and that will set a common benchmark so that you engage with others and you can build on that basic foundation. We will go a step further in the third stage where we will talk about the basic theoretical concepts. So I being a researcher, so I research on social media, e-commerce and healthcare and those are the areas of interest. In fact tomorrow I will be presenting my research work. It is a build up to that because I want people to be introduced to social media, social network analysis. So that tomorrow when I present my own research work then the foundation is laid to understand them. So in the third section we will talk about some of the concepts and some of the theory around it and do not worry it will be kept interesting and easy to relate to. I will not get into the maths which is easy. The stuff which is on the internet or Coursera courses those you can look up there the maths is the easy part but more important to get the intuition what does that concept mean and then thereafter the rest is either inbuilt formula or you can look it up on the internet. In the last section I will share some resources which you can look back when you go from here. There is also a quick introduction to the tool itself. There are many tools but I will just quickly give you a glimpse of one or two tools whatever time permits and in case any questions if it is relevant you can ask during the talk or else we can ask them towards the end for any reactions or questions. So networks are there everywhere. In fact in many places as you go through the presentation you will realize in many places you have not realized that there is a network existing. It is only when someone tells there is a network possible here as well that then you realize oh that means there are networks here. So where do you see networks? You see them on our telephone contact books. How does that become a network? So if you are true caller, if you are WhatsApp and you have the access to the android permissions to take the contact list then what happens? You can aggregate the contact list of one to one million ten million crores of people to then build a huge network. So that is the telephone network. Telephone companies alone may not be having access to it but also the app developers who take access to the contact list they can easily build the network. So in fact the security agencies are pretty worried with the amount of network information that someone an app developer like true caller has which may not be based in India but it has access to a lot of contact lists of each individual. So whosoever is in my contact list is my contact and the way a network is structured is there are two people and then there is a connection. So the connection between two of us is that a person is important enough whether by frequency of calling or by importance in offline relationship for me to store the phone number of that person in my contact list. So that becomes a possible say social network which one can easily build around. In the olden days we used to have telephone directories but these days the directories are all in the individual contact list and to build a directories someone will have to access them. Internet again if you think of the multiple routers and who is accessing through which router and then dynamic connections the internet itself is a interconnected network of say devices. So that itself becomes a network it is not exactly a social network but a lot of network analysis research draws from the research in the internet or the telecom networks. Social media of course we will see a lot of examples of that and social media could be all your Facebook, Twitter, Instagram or what else did I miss out YouTube or any other network where you have the relationships in the form of liking someone. So like is a type of relationship following someone. The type of relationship commenting on someone's profile is a type of relationship sharing something is a type of relationship. So there can be different relationships and the network is constructed on very simple fundamental basic entity and that is there are two say nodes or two entities and within those two entities there is a relationship. There can be multiple relationships and then there can be aggregation of nodes where once they all get connected based on their individual relationships it forms a network. So social media that is the topic for today and that is a huge network. Projects and within projects a movie is also a project. So those of you who have studied the organization theory you would know that within organizations there is something known as a hierarchical organization or a functional organization or a matrix organization. What that means is hierarchical there is a department sub-department and going down further to the individual team and individual. In a functional there is a horizontal so there is an HR which will cut into the geographical region so East India and HR and that will become then a matrix organization. So there is a functional HR operations finance and then there is a hierarchical which is East Dome, so and so state, so and so city and then there is a matrix where you have a dotted line and this but what is of our interest is an organization known as a projectized organization. The projectized organization is like we have in the IT company like we have in the movies. So I am presenting tomorrow some part of my earlier research around how the Bollywood network structure revolves around the movie. So relationship what is the relationship there if I and someone else work together in the same movie then we have a relationship. It is a co-actor relationship or co-producer or co-participant. That causes bias. This entire curve is sort of shifted to the left. So we have two users. User one has three missed calls. User two also has three missed calls. The difference being user one did pick up the fourth call whereas user two has not. So what happens is that this little user one user two sample actually which is in the bucket of people who have missed three calls could have been anywhere down the line. So you have a whole bunch of people who are in some bucket which they do not belong to because their data was censored. So user two is in the third bucket. That's what we know from the data that we have but he could have been anywhere and that is censored from us. It's quite possible that this could have happened and he would have been in the bucket with five missed calls. So now you can see what censoring does because you can kind of get the intuition how the curve would have shifted had we known what the proper bucket was. So that gives us our constraints. What's the kind of model that we need when dealing with a problem like this? So we need something that fits reasonably well goes without saying. It has to handle censored data. It has to be differentiable at all values of t because h of t has s of t as a denominator and if that's undefined at some point h of t becomes undefined and we cannot have that. And we want this to be as interpretable as possible. The reason being the outputs of this model are going to be handed over to the collections team and these are the people who are making the calls. So the more they understand, the better. So that brings us to the first part, the Weibull model. So that's the one that we went with. So that's the PDF. I'm not going to read this out but the important term to note here are the parameters. Importantly row which is shape and Weibull can either go down, the curve can go down or it can go up. And the good thing with this is the parameters actually do tell us something about the customers. So these are what the hazard curves look like for different row values. So if you have row which is greater than one, that means increasing hazard. More the number of missed calls, the more likely you are to connect. It does not happen and that's the blue curve and that goes up. Had that been the case, the way we would start calling people is more attempts first. So if you've made four attempts, you continue persisting on that person because he's more likely to pick up now. Whereas if you have row equals one, you get the flat line which you have customers who are indifferent. It doesn't matter who you call, call whoever in whatever order, it's completely fine. Whereas for row less than one, which is what we have, you get the green curve. What the green curve shows is what we do know intuitively more missed calls equals less likely to pick up the next call. So in a situation like this, what we would do is we would call the person with the least number of attempts first. And when we fit this against the data that we have, we get this fit. So the orange curve is the one that we've seen a couple of slides prior. That's the censored data. And because why we'll handle censoring really well, that's the curve that we get. As you can see, it's sort of shifted to the right because it's getting censoring factored in. So was viable or only choice? Actually, no. We actually started with a really simple model called Kaplan-Meier. This was developed by Kaplan and Meier a long, long time ago and this was developed for the calculator era. So you can actually take, like if you sit down to do the survival of your customers, you can actually pick the calculator and punch numbers in and you can actually walk your way through the curve itself. But there's problems with it. The first thing is, it's less interpretable. If you compare the smooth viable curve to the jaggedy Kaplan-Meier, I think viable is a lot more interpretable. The second thing is, it's nonparametric. Because it's nonparametric, it requires a whole lot more data than viable, which is parametric. And the other good thing about viable being parametric is that you can extrapolate data. So let me just go back one slide. So had we gone with Kaplan-Meier, the curve itself would have terminated at, let's say, 20-22 calls. Whereas with viable, you can see that it goes all the way down and we can actually extrapolate what would happen if the person misses 50 calls. The other thing is the flat lines that you have in Kaplan-Meier. So at that point, the derivative becomes zero. Therefore, the HFT becomes undefined. We can't have that. We do need that function. There's other alternatives. There's power curves. One upon x, that's one, for example. There's also piecewise exponential models, which are nonparametric, but also differentiable. But let's stick to simple. Not upon. So, viable has this assumption that everyone's behavior is the same. You miss three calls. I miss three calls. Free lunch, Freddie misses three calls. It's all the same. That's actually not the case. So we do need to factor in that, okay, fine. I've probably used one merchant, three lunch, Freddie's on three, you're on two. And this will probably change the probability of you picking up the call. So, here's us again. I tried using simple. My transaction was declined because I was busy preparing for this talk and my bill was due. And because I'm busy, I miss three calls. I do open the email that my company sends me, whereas Chris does not. He has proton mail and he does not like to be disturbed. And he owes a whole bunch more money than I do. So, who do we call first? Him missing three calls, me missing three calls, are they the same thing? Actually, no, he's a lot more shady than I am. So, that brings us here. Cox proportional hazard. So, that's what Cox proportional will give us. It's like you have this really neat separation of concern where Weibull will tell you, okay, fine. What's the probability of the fourth call being picked up? Whereas Cox proportional will tell you, okay, fine. These are your inherent user characteristics. What's the probability of you picking up the next call? So, the data that goes into Cox proportional changes. So, you have a three tuple that goes into Cox proportional hazards. So, you have CITI and DI. So, CI is a one cross P feature a vector of user features. It's called CI because in literature your features are actually called covariates. So, just left that one there. PI is the duration. Now, because survival models actually come from biomedicine, PI would actually be physical time. In our case, we are taking time to be the number of missed calls. One missed call, two missed call, three missed call. That will be along the x-axis. So, CI, whether death was observed or no. Again, survival model tend to be on the morbid side of things. It could have been in a cancer trial, physical death of a user. In our case, it would be whether you picked up the call and I'm pretty sure you won't die. Trust me, the movie is an urban myth. Okay. So, what puts the word proportional in Cox proportional is the fact that it's a hazard, let's say it's HI of T and Chris's hazard, which is at J of T at C missed call. Let's replace T with three missed calls. It will be some constant. And because it's some constant, what we can say is my hazard at some time T is going to be at zero of T, which is baseline times the exponential raise to features times their weights. So, CI are your features and betas would be their weights. C zero beta zero plus C one beta one dot dot dot CN beta N. And this edge zero we actually do have that comes from our Bible model. So, this leaves us with finding out what those beta values are. And how do we get those beta values? We will get them with max likelihood estimation. I'm pretty sure all of you are familiar with max likelihood, but here's a little revision. So, the way you do max likelihood is you will write down your probability model as a function of unknown parameters. In our case those are beta values. And what you want is some value for these betas which will maximize the probability of seeing the data that you have seen. Maybe you want 0.2, 0.3, 0.4 and these are the beta values which will have the highest probability of seeing the data which you have actually seen. Your data set. And this optimization you can do with I don't know, gradient descent, expectation, maximization, Newton-Raphson, pick your favorite. So, for a single person this is how you construct the likelihood. So, let's say you're constructing my likelihood. So, it will be the exponential of my features times the beta values divided by the sum of everyone's beta values times their features. That's your likelihood for a single person. So, the likelihood for the entire data set would be you just take all of these values and you create a product term and you just multiply them together. So, we don't really optimize this by hand or by using PsiPy Optimize. We use this Python library called Lifelines. So, Lifelines is this really neat library which is completely geared towards survival analysis. So, we just plug the data into it and let it's do its thing. I actually forgot to tell you about the features. These are actually not the features that we use obviously for security reasons. So, what comes out of Cox proportional model is actually this really interpretable result. So, amount has a beta coefficient of negative 1.39. So, what this means is as the amount goes up the probability that you will pick up goes down. And this actually makes intuitive sense. I mean, because there was more money he's less likely to pick up versus I have the amount I actually did pay my bill. So, merchant scene. The more merchants you use, it has a positive coefficient. The more merchants you use, you're more likely to pick up because you understand how awesome simple is and you want to continue using us. Now, spending trend has no bearing. It has a really high P value so you can actually ignore this feature itself. So, let's put the pieces together. We have Vibal which tells you three missed calls what happens to the fourth. Cox proportional tells you okay fine, your user features what's the probability you pick up and we can just multiply the two together and get the priority score out. So, this is sort of the general recipe that we use in production. So, every day we have an effort job which picks up all the users to whom we have made calls. So, from this data we know, okay fine user one has missed three calls, four calls, whatever. And then we actually calculate the hazard for the next call. So, if you have missed four calls, we calculate the hazard for the fifth call. What's the probability that you pick up your fifth? We do filter out the ghosts. That is sort of something that I'll explain a little later in the talk. Now that the hazard is out, you know who's the person who's more likely to pick up. So, what you can do is you can actually rank order everyone who you're yet to call and push this into the calling system and then people who have the highest hazards people who will get the calls first from the collections agent. So, these are two users completely hypothetical and these are their hazard curves. So, as you can see, the person with the blue curve has a generally higher hazard compared to the person in orange. So, what happens is when the agent start calling the person in blue will probably get a whole bunch more calls call attempts first and start calling the person in orange. So, you'd probably make, let's say, five calls to the user in blue and then they said, okay fine, at this time we have to stop and then let's start calling you know, let's start calling the user in orange. And this is calibration curve we have for our cox proportional model. So, the calibration curve is on the right. It's not perfectly calibrated. It's reasonably well calibrated. It's directionally correct in the sense that it does agree with our assumption that more missed calls equals less likely to pick up. And that you can see from the graph on the left. So, we have a whole bunch of calls on the left as you can see that from the bar chart. And as you can see as the hazard goes up the call connect rate also goes up. But beyond a certain point it just becomes noise, right? So, on the left what you have is cox proportional hazard working reasonably well. But on as you go towards the higher hazard side it becomes like really squiggly lines and I mean it's just noise. So, we have to smooth this out. And the way we smooth this out is we actually fit an isotonic model on the blue line itself. So, after a certain threshold the probability of you connecting actually becomes one, right? You have a reasonably high hazard, okay fine. We don't want specific numbers. We know that you're going to pick up your call. And this is, these are the results of assimilation, right? So, if we were to call people by random this is the result, this is the connectivity we would get. Whereas if we were to call people by score that's the hazard scores. This is a connectivity that we would get. Now the y-axis has been taken out. That's because this is competitive information and my co-founders here, I really do not want to disclose all of that information. Right, so what we've spoken about so far is prioritization. Call prioritization. It's like, okay who gets the call first? But we still haven't gotten around to finding out free lunch ready. That's what we'll do next. That's what cure models are for. Okay, so in the group of people that we have to call there will be those who will never pick up the call no matter how many times they try. It's a possibility that free lunch ready to call the SIM card and through it away. So, if we keep calling him it's quite likely that we'll burn more money than he actually owes us. So cure models are the kind of models that let you answer the question, okay fine. There are people in your dataset who are never going to die who are never going to pick up. So let's deal with them. And I'm not really going to get into significant detail over here. The section is here simply for the sake of completeness. So we've all seen Bayes' rule. So what happens inside a cure model is the following. You are actually trying to find out the probability of someone being a ghost given that they have missed end calls, comma their feature vector. So let's say this is feature vector for free lunch ready. So we can sort of work our way down and let's work through this. The probability of someone being a ghost is let's say at one or two percent. The survival function will tell you that okay fine, you made one call, one call attempt to someone they did not pick up. What's the odd that they will pick up the next? 30% because whenever the collections agents start calling nobody picks up their first call. You have to make two or three attempts before they actually pick up. So now as you keep missing more and more calls this will approach one. So it's going to be let's say 0.1 upon 0.1 plus 1 minus 0.1 times a really small number. So this will approach one and you are really, really to be a ghost. Let's consider another scenario. Let's just say some fine day Lazy Lakshman is being even more lazy. He's not picking up calls. We've made 20 calls to him. He's not picked up any. So how do you differentiate between Freelunch Freddy and Lazy Lakshman? So in case of Freelunch Freddy the probability of him being a ghost would be two or three percent. We've already wasted 20 calls on Lazy Lakshman. What's the probability that he's going to pick up the next? They're kind of similar. Maybe a two percent chance of picking the next versus a three or four percent chance of someone being a Freelunch Freddy. What do we do now? How do we persist? Now the curves that we do get especially when you start plugging and plotting the formula that was on the slide prior is you get this and it's actually a sigmoid but the reason you don't see this S shape is because the survival function is not decaying fast enough. So the blue line could be Diligent Dave has now decided to become Freelunch Freddy. It's highly unlikely for him to miss five calls so you would say, oh fine, probably he's becoming a ghost. Whereas the orange line could be either broke babu or Lazy Lakshman. It's okay for them to miss these many calls and not be a ghost. So 20 missed calls, we don't know who's what what do we do now? So we can use something called the mean conditional lifetime and what this tells us is okay fine, you have missed 20 more calls. How many more calls do I have to make so that the next call will be picked up? So if it's one or two calls it's fine, we will persist on calling you but let's say the mean conditional lifetime comes out to be 30 more calls. Is it worth it? Is it worth it to treat Lazy Lakshman any different from Freelunch Freddy? Probably not because it might happen that the economic cost of pursuing Lazy Lakshman is actually a whole lot more than the money that he actually owes to us. So what we will do in that case is we will escalate him from level one to level two. So this is the complete pseudocode. The great outline has not been filled in especially point C is where the magic happens. So if the probability of someone being a ghost is above a certain threshold we will move them from level one to level two. Or if they are below the threshold and the economic cost of us pursuing them is actually really high we will say that okay fine it's no good for us to be polite to them it's good for us to move them from level one to level two. And the rest is the same. So let's see how we put this into production. Previously we had spreadsheets where everything was stored how many calls to make before we move someone from level one to level two that was decided by the manager of the ops team based on his experience. So if you felt like okay fine let's say maybe five calls. And that five was a setting that applied globally to everyone. We know that's not the case we can be a little more clever than that and we can tune the number of calls to make for someone by using survival models. And we let everything happen automatically. So we have systems that look at the number of calls being made to someone like okay fine the day is progressing you have your agents making calls someone has missed three calls okay now Hazard is really really really low he has a really high mean conditional lifetime let's just move down from level one to level two. That happens automatically so that's good for us it's more productivity for us. And when I was proposing this talk one of the reviews that I got was okay fine you know for the engineering people you should have this really elaborate diagram that explains how everything is done. So here it is we have an SQL database we store our data there in tables. We have a python script in airflow which does model training and then we push the results to the calling team that's our grand architecture so yeah if you're looking into putting survival models into production here's my two cents understand that every model that you pickle in store beat survival model beat isotonic regression what have you it is sort of this crystallization of assumptions right that that's predicated on what data it was trained on what was the quality of the data the volume of the data and when it was straight so in our case we need to keep retraining on the first of every month yes so questions by the way we have a boot outside we are offering really cool stickers so yeah we have questions hi where are you oh yeah there you are nice presentation my question is a bit generic and please excuse if I sound a little too skeptical basically you're trying to model human behavior is it right to say that and my sense is that when you do that there is always an element of uncertainty which our models cannot capture now in medicine it might be okay because we are not going to do anything about that survival rate it's just a good to know information but here you're actually making calls and you know following up with people so what is your whole take on this because there is only a to a certain extent you can model human behavior it's very unpredictable and the second thing is when you make 20 calls maybe the user gets irritated and stops taking calls even though he is a lazy election or broke babu and not a free lunch ready so what you do influences human behavior which is exactly what you're trying to model so I'd like your comment on that thank you okay so you actually have two questions you're asking me how is our application different from biomedicine and the second is if you make 20 calls what happens you're actually creating this feedback am I getting this right okay so in case of biomedicine the survival rate is not just a good to know if it was Kaplan Meyer let's say you would actually see the curve go down as people die the curve actually starts going down so let's say you introduce some change in the protocol in which the medicine is administered what you would see is that the curve becomes flat that means people are not dying so what that gives you is an insight that okay fine we change the protocol probably the amount of medicine that was administered to someone and you know people are living longer now so survival is not just a good to know it actually gives you insight about what it is that you're doing that is actually affecting people's survival the next question was what if you make 20 calls to someone you're likely to irritate them and they probably might not be owing to the fact that you've made 20 calls to them now the thing is we do not want to make 20 calls that's the whole game right so if someone has missed 20 calls it's actually a whole bunch of money wasted for us we could have stopped that let's say 5 calls and that 5 would have been your hazard and that would have come from survival model itself so we could have stopped after the elven calls could have stopped after 5, 6, 7 calls depending on the person's hazard and you could maybe have sent them a demand notice or call them from a separate burner phone or whatever and it would look different to them hello hey thanks again for the talk I had the question more about like you know the call start and how do you deal with new user knowing that you get data you know only twice a month how many months do you need to make sure the feature vector for the user is actually like reflects the behavior of the user correctly enough so the question is what do you do about new users and you're retraining every month the reality might have changed in one month so the one thing is the bulk of the calls that we make are for repeat users people who have used us for 2 or 3 months in the past they're continuing to use us but now they're delaying to settle the bills so proportion wise they are a whole lot more compared to the new users that we have so for now the model focuses on modeling the behavior of the repeat users because more data that we have for them the next question was actually for the next question oh yeah once a month the reality might have changed how much data do we need now the call backlog that we have so what we do is we keep persisting until the user pays either they will pay themselves they're like oh I have to pay my simple bill I forgot they called me 2 days ago and the system will be updated and that person will be removed from the calling queue but if they persist what happens is that we get this really giant backlog of data we have more people to call compared to the agents that we have who can make the call so what we do is we train the model on the past 6 months of repeat users and see how what comes up my question is are we using the same baseline and the proportion model for all the users are we making multiple models so in any case what is the justification for both the choices so the question was is the baseline and partial the same for everyone no the baseline will be the same let me oh let's let me see if I can get there quickly so H0 is your baseline that comes from your viable model and that's model over your entire population like let's say I pick up 10,000 users I pick up those who have missed 3 calls and then I see what's the probability that that will give you H0 that is common for everyone what Cox gives you is e to the whatever and that is something that is unique to you so your partial hazard and my partial hazard are going to be completely different yes those would be different for everyone so the question is how do we make clusters you don't have to make clusters why would you make clusters could we move on could we continue that discussion and take any more questions if there are any more questions on that side hi thanks for the talk sure I had two questions do you see any time of day effects in terms of when you call at particular times do you see a higher rate of pickups and I'll just get to the second one as well and okay yes time of day effects oh my god I hope not revealing too much here yes time of day does affect the probability of someone picking up the call so if you have someone who's an office goer and if you call them during their office time they're less likely to pick up verses let's say you call them either really early in the morning or later in the day when the office is over and let's say they're home right so those are the kind of things you actually can factor into Cox proportional model you can say okay fine we know that this person is he uses our app oh you can sense my hesitation can't you okay so let's say we can find out from the other okay fine the time that he's active is later in the day we can factor that in and see how that affects what was the second question my second question is what other apart from calling you have other avenues like emails and messages yes how do you use that in conjunction with calling and does that factor into the model yes I think I briefly did mention that right probably didn't stress it enough so we do factor in whether you receive the emails or no and how many of them do you open questions by the way please come see us outside we have cool stickers thank you a huge round of applause professor there you go so we've had we're coming to the end of the morning session we had two fantastic talks taking problems that people have been solving for a long time and applying it to current day context and trying to make it work in their context both the vehicle routing and that collection I think are fantastic examples of taking very traditional algorithms and traditional approaches that have been used in other industries and trying to bring them to what you're trying to solve and I think that's one of the aims as people who are trying to apply data science or machine learning to their industry so excellent round of talks and the first first session so we'll take a break now we are now there is a morning beverage break which means there is chai and coffee basically so go get your chai and coffee and be back here by 11.25 where we'll have a next set of three talks primarily around data pipelines thank you can we have a test run hi so welcome hello welcome to fifth elephant hello welcome to fifth elephant okay so here I'm going to talk about seven steps to build your own data pipeline okay hello yeah here this is nothing here from abdynamics it's okay it's little echoing there is a lot of echo the auditorium thank you hello can you hear me well is it clear enough yeah talking about ML platforms should I get closer yeah this will be primarily a systems talk the earlier two talks were focused on the mathematical and statistical problems this will be all about productionizing some of those statistical models hello hello okay am I audible at the back side is it clear okay all good right it's echoing okay okay how's it right now is it clear okay is it clear can you hear at the back side clear is it echoing all good okay welcome back just a few announcements before we start the session after the morning beverage break if you bought the fifth elephant t-shirt online you can collect them from the t-shirt counter outside near the help desk after 11 abdynamics has sponsored cupcakes for all participants so you can thank them at their boot if you are using the wi-fi and you're sharing your experience online then you can or on twitter I should say the event hashtag is hash fifth l or you want to follow what other people are saying hash fifth l and the twitter handle is at fifth l there is a redis ai tutorial which will start at odd e2 in at 155 so if you're keen to attend to that go look through the outline see if you're interested and install whatever is the required software to attend the tutorial also if you're not talking here but you're keen to talk we do have flash talks which will happen in odd e 2 tomorrow at 455 no on today sorry day 1 so today if odd e2 455 to 530 if you're really keen to talk it's roughly 5 minutes lot you can come and talk about anything you're doing in the demo you can use a laptop otherwise you can just come and share what your experience is and what you're trying to do and get feedback or connect with the audience if you're interested just write your name topic and what you're trying to do and give it to the hall manager hall managers in either of the oddies right ok first sessions were really around algorithms and how people are using them now to solve problems we shift gears and we go into data pipelines and one of the hardest problems as people start to build use machine learning or data sciences how do we get all that data inside our systems we're going to have three talks around that we're going to also have so these three talks are all linked on data pipeline all the three speakers will have their talks we'll also have a combined q&a with the three speakers after lunch right so if you have questions if you have questions regarding the talk then ask them after the speakers talk but if you're looking for some connections across the three talks and have some broader questions hold on to them we'll have a q&a with all the three speakers back after the lunch session right so to kick off we're going to see how do you start to build your data pipelines when you're just starting right so don't leave it for too late how do you kick start that process right when you're doing this starting your startup or a new project and kind of walk through seven steps or ten steps okay seven or ten seven or ten steps around that so over to kumar for that thank you see thanks guys hopefully this doesn't happen again so yes I am going to talk about seven steps to build your data pipeline I figured out sorry for the confusion I reduced the number of steps just so that it fits in the timeline and it's easier to grasp and primarily the way I've structured the talk is going to talk through some of our experiences and kind of our war stories how did we end up doing it and then before that I'll just explain who we are exactly so yeah so who we are we are a mobile game development company based out of Bangalore we are moon frog we are making mass market mobile games primarily for all Indians and Indian subcontinent we are approximately 5 million plus daily active users as of now so large scale in terms of product and 15 million plus weekly active users we make real-time cross-platform games for Indian subcontinent we have been doing it since late 2013 so this is like pre-geo post-geo so we know this market for real-time and all that stuff also we are highly optimized for India and Indian subcontinent in terms of our product the most important point is we are a profitable company as well what does it mean for data what does it mean we make mobile games, consumer products for this kind of scale what does it mean on data side is our current ingestion scale is 20 billion unique events per day and total size of data that we are ingesting uncompressed is 800 GB plus no now obviously this is not the scale we had on day 1 and this is not what we designed our initial or v1 of data pipeline to handle but I talked through some of our learning some of the journey some of the decisions we end up making and why and hopefully obviously like all of you will not have the same set of problems or exactly the same things that we had to face or we ended up doing but I hope that some of the learnings or some of these things will help you make the right decisions whenever you are at the decision forth point now what we wanted on day 1 so I have made sure that I have added for day 1 of your start up or day 1 of your product because we, me and my team we were in game development for quite some time even before doing moon frog so we kind of knew that how important data is and most of the people here in this room for sure know this like when you are launching your new product you want to know how users are using it what's happening in the product is it behaving the way you designed it to be so all that stuff so it's very important it was very important to us that we kind of are very clear what we want on day 1 not on day 30 not on day 365 or what on day 1 is absolutely important for product of business success so these are some of the points first we needed access to data of the product or the game launched immediately like as soon as possible like when you, when we launch a game and I'm pretty sure for you guys also who launch consumer project when you put out the product on google play store or website or wherever you get one user in the first minute or first second or whatever you get 100 users in the first day or something like that or 1000 maybe you don't get a million users on day 1 necessarily right so you need to actually needed to have ability to kind of query role level at user level as well they kind of know what did this user do what did he or she ended up doing clicking where did the drop off why and all that stuff as real time as possible because we were making consumer products games we had to kind of do something they okay go in hands of the user kind of see how this user is playing it okay what happened okay the loading time was too much or whatever all that stuff so as real time as possible so that we can fix it also as good second cost sensitive being a startup obviously we didn't have we were bootstrap startup so we didn't have all the money we had to worry about this infrastructure side wanted to start with free credits as most of the entrepreneurs want to ops light because ops heavy stuff will cause you too much pain in the early days and kind of distract you from building the actual thing which is the product then resource also like for all entrepreneurs here like you do not have engineers just sitting around to do build your data pipeline all the time so we realize that okay we actually will spare half a engineer like half the resource try to build something that we want which we require in the early days and then slowly as the product scales then we'll scale this team to maybe one engineer which is a hundred percent growth and then large the last step also we were very clear that the games the kind of products we do they have large scale up requirements they do not give you time to hold on do not play our games we are currently upgrading our research sorry like users will move on so we knew that okay we need to have scale up capability from day one but obviously you cannot scale up to whatever hundred billion or hundred million users just like that you don't want to over architect on day one so we figured out the ceiling we said shouldn't cause it till one million daily all right so here are the seven steps I'll just quickly read through them but we'll go into detail first is understand understand your requirements and constraints your business requirements and your operations tech and all those constraints second thing generic I'll explain why it was super important to us and hopefully there will be learnings here because we make mobile games we don't make one mobile game we have to make games continuously at some frequency so we don't want to do data pipeline again and again we don't want to figure out new data nomenclature and hold to apology again and again so we had to figure out or think from something generic on day one third produce data well in games like that data is the most important thing because you can't go to the user and ask him or her that okay why did you stop playing our game or why did you start playing this game why did you like this or why did you not like this or you are going to get is I stopped having fun or I had fun and which is more or less not much useful to product developers like this fourth design v1 of your data pipeline know that it will change and it's only v1 fifth open up enable many data interfaces primarily use the data make sure your company your people everyone is using the data as much as possible otherwise no matter how much you ingest if it's just sitting around it's just a waste of resources anyway then tune in repeat and last step upgrade to v2 so I'll go through these seven steps our journey primarily and explain leo in our context so first step is understand requirements and constraint we figured out there are three different verticals that we had to worry about first is business obviously so what are business requirements and constraints on day one like they wanted real-time ingestion they needed real-time ingestion so that they can query it as quickly as possible fast query speeds don't want to like I want to have this I have this question in my mind right now why did this user stop playing yesterday if I cannot get the result as quickly as possible then other things will get important and it will just slow down my iteration speed anyway third SQL query interface this was a constraint that we put a requirement that we figured out that all our people and we knew that whoever we are hiring SQL is the right interface to build on then row-level granularity as I said we needed user-level granularity we can look at some aggregate stuff okay how many people from our astra played yesterday and did this but then at the end of it you'd actually need to get down to that one user that one session look at what happened how the funnel actually looked like then rich events rich events means that okay primarily we wanted all as much data as possible available in that particular row itself somehow because SQL was the most important thing we wanted every person in the company to write SQL or more or less like that every product manager coming from anywhere has to write SQL data scientists developers maybe even QA so obviously like they are not going to write 6 level of inner joints and all that all the time some people will do but still you wanted data available in the first level itself tech side we wanted generic data design we knew that will make more products more games so we wanted data design to be generic forward and backward compatibility very important because we are upgrading our products continuously product is not going to look like how it looked on day one after a month itself so you want full compatibility simple architecture so that it's easier to manage and all that stuff too much complication and all that will be a kind of a distraction from the actual products itself forth was very important that's a constraint we put immutable data we said these are all data events analytics events so data will be immutable and that's upfront let's just talk about it it has to be immutable no nobody is allowed to change it you make a typo fix it in the product and it will start showing up better tomorrow ops hosted services as much as possible so that we don't have to worry about ops scale out capability and resilience to bad queries this is also very important we knew that everyone in the company we want them to use equal to kind of query themselves so there will be bad query someone will forget to put a limit and it's all right we have to have infrastructure which can handle to some extent now at bottom I have given example a okay what kind of queries we were talking about on day one just to give you a flavor of what it so how many users who played at least one game yesterday saw the bonus pop up for the first time now this is a very simple question in our business this is not a complicated question which will come and this itself if you convert into a sequel you will see that it will require a couple of inner joints some tables depending on what you do and then the actual analysis starts once you get the answer to this then immediately the PM or the data person is going to ask how does that compare to last week or yesterday or something like that or maybe last year sometimes so all that is very valid problem statement for us let's come to the actual the interesting part which is the data schema design now we kind of listed on some rules like that will follow where it has to be generic we knew that okay we will I will face a problem someone in the team is not gonna use the data to its full power so we figured out we wanted many data interfaces and again the rule being keep it simple transparent taxonomy your misuse is okay and use third party interfaces so direct sequel workbench evaluated several tools made of as redash and standard python scripts to business reports we realize some problems early on realize temporary need for extra data in some tables sometimes because we had a moving window sometimes people need the last month's data also last year's data so we created rendic automatic jobs which users can run to do on-demand tables and also sometimes more historical data is needed like last year IPL what was happening so that kind of stuff we figured out much later and we created a short ad hoc red chip clusters you kind of the clusters you create a smaller set and do your independent analysis and all these tools we ended up building so that our teams do not need data pipeline or data platform people continue so they can just keep doing it themselves then tune and repeat not all data is important always back up drop and rebuild very quickly and data will continue to increase so problems bottlenecks in ingestion we had to figure out solutions to insert versus copy we figured out that copy how to tune it properly how many queues to use what are the right variables to continuously like tune for our usage our ingestion speech it will be different for different people parallelization requirement for ingestion solved by tuning the queues and thick client the also leading to CPU spikes so optimization in file handling and redshift and load tuning also we had to figure out so we ended up for example breaking the bigger tables into smaller tables as well just to kind of give these are just views to kind of give users a kind of easier access now upgrade to the v2 so this is what the data pipeline approximately looks like in our application now is all the different kind of producers here different this decay is different like in the client itself or in different micro services of different languages and different types it comes to what we call we upgraded square root to badger that we call which is a whole pipeline which actually processes the data in different stages I will just talk about how this is built and this is completely in house and then data is still directly going to data warehouse as well as to data lake and both are still redshift and s3 also we have added a parallel pipeline for smaller set for different product alerts business alerts and real-time dashboarding and all that stuff which goes to mem sql and all those warehouse and your real-time data store and as well as the data lake all are directly accessible to our dashboard reporting bi and users themselves by doing direct sql queries now just explanation badger is a file ingestion client which is thin file ingestion client written in golang rotates files of nsq then this was the badger pipeline which is a high throughput ingestion back end we use all the pieces are written in golang we use nsq for high throughput message passing and archive files to s3 in csv and parquet format as well now and batched upload to redshift as per priority queues and mem sql for maintenance like that's very standard so this is our current scale as you can see games we have a daily pattern so it follows the traffic patterns of the products as well total events per day 20 billion plus total size of data per day is 800gb plus uncompressed average events per second is 200,000 plus peak events as you can see goes till 350k and rows maintained hot loading aggregate tables is 250 billion plus rest is obviously all backed up in data lake we can load it whenever we want in ad hoc tables or in the main cluster as well so this is what I had a high level walkthrough of what our journey has been again just to reiterate these were the seven steps kind of we ended up following and thanks for coming and listening hopefully if you have any questions I'll be around I can take a couple of questions now but outside feel free to catch me and happy to answer any questions awesome we have five minutes for questions so a few questions this is excellent presentation thank you so the pipeline technologies look great it's simple easy to understand but just want to figure out you know what are the provisions for monitoring the data quality or any kind of data leaks happening what are the matrix you use for monitoring the data flow or any mess primarily good question like we monitor as we don't have there is no differentiation between data for the pipeline so what we just are looking at in different phases of the pipeline how many rows we are processing and we look at that itself is also getting ingested in the pipeline we look at that live that okay are we how many rows we are discarding due to bad formatting or something of that sort and how many actually we are able to parse through the funnel and if that crosses a threshold then that's an alarm but generally never happens because we are not doing any data transformation what you are inserting is what you are getting so there is no transformation so generally that is 99.9% data coming through unless somebody has put in a special character which is breaking something but since we the data team controls the sdk as well and all the pieces of it that rarely has happened for us going from v1 to v2 I mean obviously would have built in resiliency and you would have experienced some failures as you move along from v1 to v2 what was the usually design response time in terms of you know when you identified an issue versus fixing it and redesigning it and then making changes while keeping the whole thing alive I understand it's a painful exercise but yeah so obviously but that's the whole point also then you get to talk about it because you have to go through the pain for us that was also very important our data pipeline actually cannot go down we treat it as like ultra most important live service in the whole company this cannot go down if it goes down you will certainly see 30% of the office like sitting idle on the tables they cannot queries anything so very important it was we realized that we have to do slow migration from v1 to v2 so obviously you don't want a full boolean switch suddenly that now we are on this amazing new system which can scale so slowly one by one one microservice at a time if we are changing something on the sdk side then we do a very slow roll out across all microservices make sure all your sanity is fine and actually for that only your data team is not your data platform team is not enough you need the data scientists as well or the product managers themselves we have just upgraded system and wait for those 2 weeks to 4 weeks to kind of see if the data is all fine and there is no deviations which is breaking their alertings reports and what not all the bi stuff so that has to be done some pieces if you are changing something inside one of the pieces in the pipeline but that is completely in your control you still do a AB kind of stuff you pass some traffic here and divert some of it but if you are 100% sure and it is a standard release or standard upgrade then you take those calls judiciously accordingly time frame like how from v1 to v2 I think the design to this would have taken couple of months maybe like development aside development is whatever it takes and then after that take couple of months to make sure that it gets smooth migration without any data losses and all that stuff and we are never breaking the day to day flow of today you cannot do anything because our pipeline is broken couple of months we can take one more question yeah there is one there hey this is Anand you talked about going with the approach of using a thick client which is a local storage before ingesting right maybe it is good for an optimization but what about a possibility of losing a data right because of like crashes it could be especially in case of microservices where they are ingesting right do you guys have those scenarios where possibly because some crashes like instance failures you lose data and still you go with these approaches you are okay with using this data so good point what we actually in the V1 itself even the thick client the thick client was not inside the microservice itself it was a separate microservice running where the microservices were running microservices had SDK inside which are just writing on a local file storage so that in the beginning itself that's a call we ended up taking microservice itself should not go down due to a bad like data pipelines component so they are just writing to a local file system not even over network or nothing of that so just a local file system so that that part is guaranteed the max it can happen is your collector or your this client is broken and what not and this folder or this particular disk is getting piled up that's the worst which can happen and you can fix that or you can have different other like archival policies to cover for that so that's how we ended up doing the thick client was thick because it was actually reading those files and doing all this stuff and directly actually uploading to redshift and s3 so kind of calling those over network data transfers as well so you end up using resources of the instance and that's why it was little slower much slower than what we have now okay I know many of you have more questions we will have a joint Q&A after lunch where Kumar will be back to answer more questions and more thoughts that you have and you can also catch him outside thank you a huge round of applause for the lovely talk okay so while the next speaker is setting up you will probably have some ideas about how to do data pipelines on your day one and you can clearly see there is a role coming out over time about how DevOps that used to be called in a traditional production environment maybe there is another role emerging of data ops or ML ops or AI ops how you want to frame it which is really focused on how you do these transitions from ML sorry from day one to day two to day hundred from design one to design two so to give us a perspective about how we are thinking about data ops in this new age of ML and AI we are going to have Nitin Gupta here from AppDynamics is going to talk about the age of AI ops and share his perspective on that over to Nitin hello hello yeah hi can you hear me at the back okay thanks welcome guys welcome folks my name is Nitin I am director engineering at AppDynamics and I am here to talk to you about AI ops which is basically an umbrella term for algorithmic IT operations and as an emerging technology in the field of applying ML and AI algorithms to a class of problems for monitoring and troubleshooting of large scale production environments so this is the agenda today I will probably introduce the problem first like what problems do normally large businesses who are going through a digital transformation journey are experiencing in terms of managing their large scale environments because right now the user experience is actually their business so this entails continuous uptime on the services themselves I will introduce the concept of AI ops like how the problems which we speak about how AI ops can solve that I will then focus on areas which are investment opportunities for AppDynamics and the work we are doing in this space what is our roadmap looking like in terms of areas which we are continuously investing in and then I will just finish with this overall vision for AI ops which is an uber vision from AppDynamics and Cisco together so I think not long ago most of the organizations just live within one data center and they just had few services running within this data center but over time this complexity has really become it's exploded there are applications which are running in hybrid cloud environments they are using various new technologies deploying through Kubernetes using dockers using serverless architectures and my rate of interconnected services using different kind of technologies within these cloud environments so if there is an issue which is happening on the end user side which is accessing the service on the mobile device how do you debug that problem it could be a code problem which could be thread contention inside the code running in some specific component it could be a configuration change which you made in the application or it could be the geographical location of the end user itself where he's not probably having enough connectivity this introduces a lot of problems in terms of being able to debug the problem I'll take a very concrete example so imagine a checkout transaction flow this represents the various services which are really powering this specific end to end transaction and then there is metrics you can see some of these calls per minute some of these calls are being reported the performance of the different APIs these are coming into the system so what you'll normally end up doing is manually configure the alerts or health rules for these different metrics across the various metrics which are coming in the system there are CPU related metrics, memory, network disk, errors, respond times so you'll end up basically creating health rules for these individual metrics and then you would also want to do it across the different services which are there there are multiple nodes which are powering these different services and then there could be multitude of transactions which you actually need to monitor so once you've kind of done first level of configuration you will end up spending a lot of time just tuning these configurations themselves because over time you will try to see that there will be a lot of alerts coming in the system which may not be the right alerts, they may be false positives and then so you want to kind of tune the thresholds to solve for those problems and imagine doing this for thousands of servers, hundreds of applications across millions of metrics and if you actually end up finding a problem wherein the system is able to baseline some of these metrics and generate actual alert which would potentially be average response time for this particular transaction is reporting or alerting to be very high or different from the normal behavior you will basically need to figure out where the problem is happening within the system so you'll start to analyze with the entry point which is the order service here start there, try to figure out the alerts and metrics which are kind of deviating for this particular service and then drill down into the myriad of services which you have and if you basically will figure out which dependent service is actually creating the problem there are obviously multiple nodes which are powering this particular application so you'll have to look at individual node performance if there are outliers specifically in the node performance and again imagine doing that for the complex application which we already spoke about in terms of just becomes very very difficult so application performance management cannot be a manual task as you can see the complexity itself with the complexity in the application and the scale which is present I mean you just human cannot fathom in terms of we are definitely at troubleshooting these problems we need to definitely adopt a mindset where we are able to the system is able to generate insights in real time and then be able to automatically take actions to solve those problems so the system should provide answers instead of having to manually investigate those problems the system should be able to automatically take actions instead of having the users analyze and then take those actions on their own and the system should be able to really predict whether the performance is degrading of an application and then make corrective actions before that the problem can actually happen so basically what is AI it's basically AI applied to IT operations with the advancement of big data technologies and AI we provide for very autonomous IT systems and I think this will only be successful if you are able to get all the data which is encompassing your system you need application-related data you need the end user experience data you need your infrastructure data you need your network data and you need your security data so you bring all of this data together and this could manifest itself into different types where all of these data elements events, logs, prime series data itself and then you apply specialized AI models built for solving for these problems these are some of the areas which are potentially avenue areas in terms of being able to detect automatically be able to detect patterns within the code be able to correlate all the data across these different domains to be able to do cross domain correlation be able to do automated anomaly detection and once the anomaly is detected be able to automatically figure out the root cause of the problem itself and then provide for recommendations or automatically solve the problem okay so given now I kind of set the context in terms of what is possible right I'll just probably hint or take you through what has been the focus area specifically for app dynamics I think these are the three problems which we are kind of focusing on anomaly detection root cause analysis and autonomous action just to probably compare against the previous world right we don't want the users to be manually configuring these health rules the system should automatically figure out the anomalies within the system we don't want the users to really look at the flow map as we showed in the previous slides where there is a flow map of different services and the user has to really drill down into different services to be able to figure out the root cause we want the system to automatically do the root cause analysis in terms of where the problem actually exists in the system and the system should start with providing recommendations to certain problems but then move over to taking those actions automatically once there is a belief in a lot of those actions in terms of that they solve the actual problems which are being faced so I'll just go into details on some of the anomaly detection models which we're using so I think one of the primary things we've kind of figured out is that univariate analysis is definitely not going to help us here means if you're looking at these metrics individually that is not going to be helpful much we need to do a multivariate analysis across across all the metrics for a particular entity right why I say that is if you just take an example calls per minute versus the cpu utilization so when the calls per minute will tend to go high the cpu on a particular instance of the node will also increase so if you have alerts specifically set for just the cpu or if you're just looking at cpu in isolation you would obviously end up having a false positive so if you're able to see both of them together then you can and calls per minute are increasing along with the cpu that is probably not an anomaly situation so all of the metrics for a particular entity are always going to be correlated with each other and we have to always look at all of this data together so when you have when you want to do this analysis it's there is there could be thousands of metrics for a particular entity so there is a problem of just the dimensionality reduction in this analysis so what we do is all the incoming variables are basically translated into new variables which are not related to each other and then those variables are basically the eigenvalues and the eigenvectors for the input covariance metrics so the covariance metrics is actually the square metrics which basically captures the covariance across different metrics for this entity and then you basically calculate the eigenvalues and eigenvectors for this metrics to be able to these provide us variables which are not correlated to each other and also capture the variance within the data set itself if these eigenvectors are then sorted by the relevance and which is the eigenvalue itself you can kind of eliminate eigenvectors which have lower eigenvalues so you basically and translated your incoming variables to a new set of variables and you are able to eliminate some of the variables which leads to dimensionality reduction so I basically have an example here which is a two-dimensional example we just have two metrics within the system and there are data points which are plotted in terms of how the metrics basically are related to each other and basically the eigenvector is the red line here which basically captures the variation very well within this data set so when a new observation comes which is the metric values for an entity you basically calculate the distance from the distribution of these eigenvectors and if this distance is larger than your certain accepted value that will be an anomaly within the system so once an anomaly across an entity is actually really calculated you basically mark these individual entities as anomalous within the system and if you have a weighted graph of how these entities are actually related based on the impact relationships of these different objects one service like a dependent service or a parent service is really impacted by a dependent service the root cause analysis is primarily a graph traversal algorithm so you start from the start and basically drill down into the graph to be able to know the actual problem or the entity which is actually anomalous and creating the problem so there could be multiple traversals which can happen within the graph and then we rank these different suspects based on the weights of the path itself and the depth of the object within this impact error so this is basically depicting how this the data pipeline which is actually running this complete flow so you have the metric time series which is all the entities which are defined in the system and it could be the only the leaf level entities which will be sending you this data this is like the actual nodes which are sending this metric data so you have the ingest layer the ingest layer basically creates the talks to the topology service to create the actual topology of the application if a new entity comes in the system there will be a new metric which this entity will be sending so you basically create the topology for this system using and we store right now store all of our this graph data Neo4j and then since the incoming metrics are really at the leaf level you want to really calculate the metrics at the parent entity level so if this is coming from all the nodes within our service I want to calculate what is the actual aggregated metric values at the service level so we have this entity metric aggregation service which calculates that and all of this raw data and the aggregate data is fed into the model itself which basically runs and calculates all the alarms the time series data itself which is coming in is fed into a time series service and we use OpenTSDB backed by HBASE for storing all of our time series data and the baselines on anomalies which are being triggered by this application we also do the training itself is in real time so there is aggregation required across this incoming data and the model is continuously getting traded in real time and getting refreshed on the actual component which is running the model for incoming data all of these services are really real time services because you have to act fast you have to detect anomalies in the system for the consumers as early as possible so I think and the scale is obviously massive I think right now with our current production deployments we are receiving close to 350 to 400 million metrics per minute and we are able to really detect a lot of these anomalies in under 3 minutes and this is some this is now looking back to our previous world just picked up the same example this is the same checkout application and then there is the response time metric which is on the right hand side which is kind of deviating basically this particular transaction is actually critical and this is the actual metric which is deviating for that and this is the topology we had so there is the order service which was the start and then there is the payment service again having a problem and the system is able to really call out these two as suspected causes and ranks the payment service as the one which is actually creating the problem it's also able to point into the actual node which is having the high CPU percentage so all of this is so once the anomaly is detected the system, the graph reversal algorithm is able to detect all the suspected causes and be able to flag the ones which are which have a higher rank and this is the actual node which also shows up the actual metrics the node level metrics which are actually deviating and it just puts that in front of the end consumers and you can also then configure certain autonomous actions with the service itself here it's a very simple example in terms of where it's just a notification mechanism which is getting sent out but you can plug in any kind of autonomous action in terms of basically take action on this specific incident and yeah I think we've already deployed this with one of our large customers which is this is a bank in UK and they've had a lot of success with this I mean they're able to really reduce their false positive or the number of incidents which are getting raised and they're able to really save a lot of time actually time which is being manually by the operations folks so I think looking forward I think these are some of the areas which are really under development for us and I just want to call out some of these things to provoke certain thought so what is happening is if a service is actually identified so this is a business transaction average response time if you see an issue in that and we identify that the service is actually creating a problem what you can do is cluster the node performance of the service of the nodes which are powering the service and if you find if you find there are nodes which are showing which are outliers or which are not showing the same performance what you can do is then figure out if there are specific tags you basically cluster the tags for these different nodes and try to figure out if there are certain tags for these nodes which are different so when I say tags this could be this could be the OS which the node has it could be the version of the software which is running it could be the other infrastructure which is kind of deployed on this node it could be the memory the CPU what is different about this node why it is showing this outlier performance as compared to other nodes and if all the nodes are showing the same level of performance then the system can really try to analyze the code segments and bring the segments which are really running slowly on top which will definitely help with the analysis of the issue itself so yeah I mean just coming back to this based on ART we cluster the performance of the different nodes and as you can see I mean there is one node which is definitely showing up as an outlier within the system so we are able to cluster the performance of the nodes you can detect some of these node outliers and once these node outliers are detected and you cluster the metadata of these different misconfigured nodes you can start to see certain patterns here if you can see there are these different tags issued with the service in terms of system properties it could be the version of the software which is running and then you can find anomalies within these tags as compared to the nodes which are actually running the right so this would really help with the analysis of the issue itself and if the node performance is the same what you can do is analyze code segments across the transaction snapshots of your application so AppDynamics provides a way to collect transaction snapshots this is basically the whole stack trace of your of the execution of a transaction within a particular instance of the application and then you can identify code segments which are appearing very often within these specific transactions snapshots and also associate the time which is taking for execution of these snapshots of these code segments so as you can see I mean this system has identified four different specific methods which are kind of looks anomalous to it and it is kind of really bubbled up the specific code segment which is showing the which basically has the which has actually appeared maximum number of times and it is the function which is really running slowly or it has basically spent the most amount of time in this specific function call so by bubbling up a lot of these code segments that can help with automatic analysis of the actual code which is actually impacting the node performance so yeah I think this really represents our overall vision I mean we have as app dynamics we always had access to lot of the data itself we have business experience and application data and through our integrations with Cisco and other things we are able to also collect network and security data and then you feed it into this correlation engine so we can invest in this correlation engine we provide for unified querying of this data itself and through this this system generator insights we actually calculate or we arrive at the root cause of a particular problem and then the system can automatically take a lot of actions I have given some examples I mean this space itself is you can do automated resource scaling of the issues themselves you can basically raise alerts or notifications themselves you can provision things within the network itself once you've identified that there is a network problem for example existing the system the system can automatically take actions in terms of fixing the network issues and then if there is a security issue which is kind of figured out the security can enforcement can really happen and once all of these actions are basically taken the feedback can be provided back in the system if the issue has actually been fixed that's what I had so yeah happy to take questions I'm also will be there outside I'm gonna be we also have a session I think planned in the afternoon so we do have time so we can take a few questions okay Nitin I had one question where does the human we're talking earlier about a lot of decisions being taken by the IT person or the human trying to build some rules and now we're going towards an automated system in this automated region where does the human come into the loop right so where is the role of either verifying something or getting the human to take a labelling role again and providing input back so how does that online learning or whatever how does that role fit into this I think one of the things which I spoke about right initially the system can really provide for recommendations to solve the problem because some of these changes can be very disruptive and every system is very different right so if the system can initially provide recommendations and humans can actually execute those recommendations initially once thus and provide that feedback within the to the system back and have a similar problem appears again the system can really become autonomous and you know take those corrective actions on its own questions from the audience yeah we've got one here yes anything you did talk about the correlations right so is it possible to correlate business processes as well I can see straight instances in what you were speaking but then when you talk about business processes that is at another level and how do you correlate it back to the real technical metrics in terms of infrastructure memory and stuff like that and how do you kind of abstract it to a business process level right I think what we are specifically looking at from a business perspective right now in terms of integrating with some of the business metrics themselves right if we have mechanisms to be able to collect business KPIs so for example say revenue is an important KPI for you which you are monitoring right and that comes as input to the system so one of the direct things we could probably do is really be able to prioritize your anomalies themselves right there could be multiple anomalies within your system business metrics can really help first with prioritization of these anomalies whether because if there is not significant business impact tied to anomaly you can probably de-prioritize it as compared to an anomaly which is creating more business impact and so that is one avenue which we are looking at in terms of how do we use business data itself to be able to feed in the system though I think these are two different things what we allow for is the business metrics are collected separately and the application metrics feed in the system separately so the application metrics are really helping you figure out the application issues and the business metrics are really helping you prioritize those issues and that is the avenue we are looking at right now happy to hear if there are any other suggestions but that is one area of investigation for us so what open source technology do you use for entity aggregation and what is the typical time window for aggregation because there could be multiple sources there you have to wait for delayed events and probably right so primarily we use Kafka streams for streaming aggregation itself so right now obviously there are fixed thresholds in terms of the data coming in and time period which we feel is adequate for us to basically say that the data is complete during this time window right now we are looking at about two minutes this is basically arrived at through years of experience in terms of what is the right time window to consider because if you just waiting for a long time it may be too late and us being a monitoring system we have to really bubble up a lot of these issues as early as possible so we are primarily looking at two minutes as being our window a lot of this is also tied to time windows when our agents are actually sending data they basically aggregate the data on the end user applications for a minute right and then send it to us so we give about one minute of time for us to process and have the data come in and then that's when we mark the data as complete and then have it flow through the downstream systems we have a question there at the back hi so you talked about having a lot of services hello, am I audible? so you talked about a lot of services having dependencies with other services and you kind of figuring out with services misbehaving and what are the dependencies of that so do the dependencies of the services need to be provided explicitly by the customers or do you auto discover that dependencies? yeah so this is auto discovered so what we actually do is once our agent framework is integrated with the application and this is just drop in jar for say a java application basically any outgoing calls are really instrumented by us right and then we do all that correlation in terms of how these different entities talk to each other on the server side so any outgoing calls whether it's RPC or HTTP are kind of instrumented by us to make sure that on the receiving and be able to detect that this was a transaction which started in a parent entity I hope that answered your question okay hold on just so when you talk about latency so when you kind of talk about instrumentation right so there is also latency factor that comes in when you instrument it right so how do you handle that? you mean the instrumentation itself bringing in overheads yeah I think that's a primary goal for us given and we've done like the system has built over time I mean we've been there in the market for last 10 years and some of the goals we really shoot for is about 3% not more than 3% of overhead in terms of CPU for the application itself yeah but there would be a certain amount of overhead and it depends on also in terms of custom instrumentation you have done amounts of the number of data points you have collected but with certain large applications which we have or large customers for which we have integrated with I think we've been able to hit our goals of 2-3% okay we'll stop here huge round of applause for Nitin he'll be back after the lunch for the combined Q&A we will have let's move on to the next session so we've talked about how to set your data pipeline from day 1 version 1 and you've set it up now you've got a system to run it to manage it to see that it self heals hopefully eventually and you've got that running and now the data is there for you to start using it for your analytics for your deep learning systems and one of the hard challenges is that you don't want all your all your analysts or all your data scientists to be constantly creating and recreating these features that go into this again and again so how do you take the data that's now available all the time and start to create something like a feature store or something that's accessible to the business to start doing machine learning or deep learning on it so to talk about that the anatomy of a production ML teacher engineering platform we have Venkat Pingali from Scribble Data and he'll take us through his journey and his experience through it he's spoken earlier about this I think 2 years back at the same 2 years back and I think he's a fresh perspective from his conversations with many of people like you and in the industry can you guys hear me? yes this is both a conversation and a thread that has been running for about 3 years so much of what I'm going to talk about is drawn up on our own Scribble's own journey building a feature engineering platform and talking to a lot of customers I can see a couple of my customers thank you thanks for giving us the opportunity to learn these are the set of questions that a fictitious customer of ours had and these are questions that a lot of you might be thinking about because there has been a very significant shift in the last year, year and half or so until recently the conversation was about can we build models, can we build interesting models in which we have confidence and so on now the conversation is shifting to can I make them trustworthy can I make them robust enough so that I can run them every day and the business can actually depend on all of them so these are some of the questions my customers were looking at what Uber is investing Uber has 400 data scientists and 25 people team only for their ML engineering platform and saying is that the kind of investment why is it so complex why is it so difficult to build and on and on the idea here is that we will give you a perspective on this cost and complexity and dive into the feature engineering one of the components of the platform hopefully you will walk away with at least partial answers to whether or not you should build and what would it be, how could it look so little bit about us we are an ML engineering company based in Bangalore and for about we've been in existence for about 3 years let me jump right into it, this is a slide from Achal's presentation, Achal is from Uber's Michelangelo team he gave a talk very much at this the venue about a year back and Michelangelo stands out in terms of its design as well as in terms of its sheer capability 5,000 models crunching petabytes of data and serving insane number of requests what has happened in the last one year much of this was built in house but in the last one here company after company all the names that we are familiar with and who are doing great work have been coming out and discussing their the architectures of the system the most recent one being the Relyard from Stripe my strong recommendation is that these designs the ideas underlying them reflect some of the cutting edge thinking and we have personally benefited great deal from that and you should also look at them for inspiration now the question is we know something about Michelangelo which is that it took 20-25 people over a period of many years to build it the question is what did they actually built we can talk in terms of technologies but can we separate out the hives and the kafkas of the world and look at the core ideas when we started looking at comparing all the ML platforms that organizations have built over a period of time and we started arriving at our own understanding but luckily for us google has already articulated some of this it was this is a picture from a paper from new new IPS conference from 2015 it was somewhat controversial because essentially what this picture showed is that all the stuff that we talk about that is the blog material all the nice charts that we have talked about what it is saying is that it is actually a small fraction of the effort that goes into models into real production what is the rest of the system doing if it is not crunching statistics and it is not scikit-learn code and when we looked at them very closely we found that the reason all of these things exist is that there is a degree of uncertainty associated with this ML code and all of these components are about achieving four objectives related to the same ML code the first one is speed this comes from the realization that nobody builds one model we are going to build more models every passing day if it is going to take a long time for you to build something then the cost of actually impacting the business is the value of impacting the business is much lower than the cost that goes in so one of the major purposes of all of these tools is to put more and more models in production you may not be doing 5000 models like Uber but you are definitely doing 25 30 models every organization that we have seen that can conceptualize one model can easily conceptualize 10 models because they have the scale the complexity and the data meet all of those requirements the second problem is correctness which is that it is not enough to put this ML code into production many of them are making very risky decisions we have we know of one entity which can whose ML model can effectively deny you a telecom connection a loan they can mark you as a fraud these are risky decisions that have a lot of impact and side effects in the world and it impacts real people so you want to know that you can actually believe this ML model that you are putting into production so bunch of the tooling is around I know that I have deployed the right one I know that it is performing well and I know that I can trust it a month from now two months from now five months the third one is evolution this has to reflect this reflects the idea that our understanding of what problem you are solving and how we are solving is actually incomplete when we put the model into production it will evolve as you see more corner cases as you get more data so you have to take the entire system along as you version these ML models over a period of time and the last one is of course scale which is the most talked about which is that all of these are computing large volumes of data so you need to have some sense of some control and scaling ability of this but the challenge that we have is that Uber's ML platform is very very specific to Uber how much of it is actually usable by us the same as the case with Stripe and these are all Unicorns which have very very unique problems and so we were thinking about what is it that we can learn from all of these and our own experience because we were building these ML engineering platforms so it turns out that there was this fantastic presentation from Gojek at Databricks conference in April which had one nice slide and you can see that there is a broad convergence there are three big problems that are being solved by the ML platforms the first one has to do with the training of the models itself you know all the scikit-learn for tracking all the versions of it tracking all the accuracy levels and so on and doing this in a scalable fashion especially if it involves deep learning and so on the second part has to do with once we have trained the models and they are actually in a database of some sort taking it and putting it into production and making sure that it is running all the time and we are able to monitor the drift and whatever other aspects of interest of the model the third piece which is the feature engine which is that today with every what we are finding is that in organization of organization people have stopped working on new algorithms the set of statistical methods that you use is reducing and it is standardizing more and more of the model behavior is determined by the underlying data sets so there is an entire piece that is about generating and managing and tracking those data sets broadly we call it a feature engineering now you can map this into direct products in the market already MLflow comes in mind there is a talk on MLflow later and the Databricks folks are also outside the Kubernetes and the Kubiflow and there is a buff on that as well the part that we felt was did not receive any attention is the whole part which involves the data preparation or the feature matrix generation that goes into these models so the rest of the four talk will be only about this left most piece and we can talk about other things later on and there is a buff later today on ML platforms there are some fantastic guys from Walmart, Swiggy and other places who are coming and talking about their ML platforms please join us for that now the question is you know what is feature engineering what is feature engineering right this is how ACME's data actually looks these are just transactions somebody bought some product of a certain time but what we really want is a large matrix of different columns columns like is this you know what percentage of the transactions go to the premium products or exotic products and so on right now the big change the reason you need full-fledged infrastructure and for this feature engineering is that if you look at what has been happening the first thing is that the set of features that you are computing has been exploding earlier we would have 20 different columns 20 attributes now thousands is something that people are easily imagining and these are changing continuously so depending upon the number of models and the kind of models and the methods that you are using the other thing that is happening is that new data is coming in new customers are coming in every day so whatever be your feature data feature matrix that you are computing you need to work at a fairly high high frequency and there are extreme cases also we have an automotive customer where they are building hundreds of these models it is not even feasible for them to go and manually feature engineer their system so there is a lot of use of AutoML and automatic feature development so you have to see until recently this whole feature everything was subsumed by the data scientists and it was a part of your scikit-learn code with every passing day as you are putting more models and as the cost of this activity is growing it is emerging as an activity in itself so today Michelangelo for example has a full-fledged team doing this and that may not be the case in most of your organizations but you should see this as a fairly standalone activity and you will understand over a period of time in next few slides on why now when you look at the featuring systems that have been built for all of these organizations what we find is that there are two big drivers to decide what will be the architecture or what will be the flavor of the system that you are going to build one has to do with what is the sheer scale at which you are operating if you are operating at petabytes that is an entirely different system you have to engineer it to the maximum the second part has to do with how closely tied are your modeling strategies with the feature matrix itself in deep learning you may not be able to decouple these and most of us fall into the top left category we are dealing with terabytes of data and we are dealing with standard ML modeling now what we find is that this is actually a large space there are many more that are coming one immediate subspace that we see even within ML is we had an IoT customer that has a billion rows with five columns the way you will process that is going to be very different from our retail customer that has a million rows and 60 columns the computational requirements as well as the interfaces that you provide are going to be very very different and this is just the beginning depending upon whether it is on the edge or in the core in real time versus not real time whether it is automatic versus non-automated this itself you will see a bunch of new discussion over the next two three years let me jump into what we have found over a period of time and so essentially the whole feature engineering platform or service whatever we are talking about takes the raw data transactions, logs, event files whatever it may be and then turning that into large matrices which are called the feature matrices and there may be even a repository of all of this because oftentimes you want to know features at a point in time three weeks back when I trained this model the features those are the kind of questions and when you look closely at this you will find that there are actually three sub-problems even within the feature engineering the core it is intuitive and we understand what it is that there has to be one piece which has to do all the processing of the raw data and generate the features and it looks a lot like DAG because this code tends to be fairly complex in one customer case this is an IoT customer we had 6000 lines of pandas code and the code would run for about 24-48 hours and it is A not possible to unit test any of this so it had to be broken up into small pieces they had to be organized in a certain way we will get into a little more of this the second part as the risk associated with this modeling is growing you want to actually know that it has computed correctly you want to be able to access and look at any data set that has been used to train any model because if the questions don't come today they are going to come tomorrow it may be from your CEO it could be from regulatory authorities it could be from anywhere the second part has to do with we actually believe what has been computed an equally important piece is you know this is computationally intensive it is also a very painful process you first want to know that your data is trustworthy to the last customer every customer's data sources are static in some way or other in the case of our retail customers gaps in data duplicates errors changes in formats all of this is a given we used to think that sensor data is a little more clean because it is being generated by machines but they have their own set of problems the sensor getting locked at a certain place even the manual process involving the sensors themselves is uncontrolled for example we had some sensors coming from a mall the staff would walk around and turn off the sensors right so the sensor is not a problem the process around the sensor is still problematic in all cases if one take away one obvious take away that you know I recommend this be paranoid paranoid at every step of the way you don't believe what has been given to you don't believe what has been given to you so having a lot of checks and balances through the entire life cycle is what is required and if you ask what is it that Uber and really hard and everybody else does it is a lot of error checking code all over the place that's one way of thinking about what they do now we will zoom into each one of these depending upon how much time there is I will give you some high level questions for you to think about how to build this so the important thing is that when I was talking about ACME unlike the flip cards of the world Uber's of the world most organizations are already burdened they don't have the ML engineers that are required ML engineers are half data scientists with a lot of system skill that's a very small intersection because they have to understand sufficiently about modeling to be able to build the right kind of systems and so every customer is facing resource limitations when they are thinking about building all of this and of course churn and staff is almost like a given and the amount of compute that they have to do everyday is also growing so you have to think very carefully about building this system because it is all some code it is all technical debt that you are undertaking right now so talk a little bit about features richness what do we want this is how ACME's data over a period of time looks we talked about the same two columns day one, day two computing and suddenly day three the version changes because we drop that old column and we get a new column and ACME had N pipelines each of which was generating N data sets like this and they are all going through versions so you can quickly notice that the number of combinations by the time a year of runs easily generated about 40,000 data sets and imagine tracking them who knows which model which version of which model you are trained with which of those 40,000 data sets right so the first problem that you will encounter if you will is how do I compute this reliably and in a fashion where I can I get some sense of confidence and essentially you are talking about a set of capabilities we all know pandas that are the core compute engines to generate the features themselves now what is what is missing in there's a bunch of this the first one is the ability to break up this complex code like I was saying our IoT customers manipulation feature engineering code was about 6000 lines of lines no way in hell you are going to be able to test that kind of code in one go so you need an ability to break up this complex multi-step process into individual modules each of which can be tested and they can all be stitched together with parameterization it is common that you will recompute all the features features for some areas and not others you will override some parameters and not others there's a bunch of the operations of the pipeline is an activity in itself so you need a set of capabilities around that the state management has to do with the fact that when there is this complex piece of code that is operating in a multi-step process you need data frames that move from one module to another module to another module and each of which is enriching, modifying, dropping is doing something interesting so the state management is again missing of course I talked about the pandas so whatever be your mechanism you are talking about getting this set of four capabilities and we can talk about what technology combination will get you these things but these are you know you have to think versions you have to think about documentation and lastly resource management the second part has to do with which came to us slightly later in the journey which is that it's actually laborious writing all of this code and our features keep growing shrinking, changing over a period of time and so over a period of time we have moved from pandas code like this which defined some of the features to more like a specification language each customer literally we define a new specification language where the data scientists or the engineers are able to express very compactly and one big thing is that the number of errors reduced if you know we all know that in pandas a single line of code can do a lot of heavy lifting so imagine 6000 lines of code and you figuring out how to make it fairly robust and specifications is probably the way to go so you should think in terms of how do I make it really easy for data scientists to say what needs to be computed and separate it out from how and keep the how very simple so that it becomes robust over a period of time remember the one reason this all matters is that the systems the production systems that we are talking about have to run every single day and there are real applications that depend on the outputs of it so you need to invest in the kinds of mechanisms that allow you over a period of time the ROI then we can talk about what are the different ways to specify the DSL Gojek has an open source called Feast it has its own specification language you may want to look at it the languages from the other platforms are not very clear but we know that there is something like Gojek so this will be you should expect that in the next 2-3 years there will be new open source new specification languages that are cross organization simply because there are some standard patterns here it is not that it is completely custom to every single organization so the this is an under appreciated capability that you need what we found is that in the case of our retail customer they actually had anthropologists who knew when you bought Haldiram's certain flavored Bujia they could tell a lot about you as you know individual family the culture the location and so on so there was a lot of tacit knowledge which the organization had they wanted to first ingest all of that information and be able to link it to features because then the features become really rich imagine they wanted in one case they had 15,000 SKUs and they want to tag every one of the 15,000 SKUs with aroma and don't ask me what is aroma of atta but you know they had ideas about the signal that was contained in it and we have found that in every one of those every one of the cases customer cases there is tacit knowledge so we will have to have a mechanism to ingest this third party information in the organization itself so please plan for it you can use you know third party integration or you can build one these are not complicated but today in our platform for example you know after the first couple of cases it was very obvious and we make it out of box now you will get one whether or not you use it this gets into more interesting part which is that you know I had so many pipelines they have so many runs over the past so much time each one of it generates a lot of data now where is all this data and what do you know about this data how did we arrive at it we are talking lineage we are talking auditability and all of those kind of things so we found that with all these mechanisms we were spending almost 20-30% of our time just concealing data where did a number come from so what happens is that invariably whenever you generate a number there will be somebody in the organization who will disagree with that right you said our return rate was 2.5 but your number is showing 3 you have to be able to defend those numbers it is going to happen in every organization because what we found is that the organizational processes are very in-discipline there are lots of there is a lot of stuff happening in excel a lot of stuff happening in emails and lots of conventional wisdom about what the facts are in a given organization so you will you should expect that if there are metrics that you are computing there are battles ahead to prove that this is actually the real metric right and you should have the mechanism to do it and so this gives you some some thinking right you know this is saying what are all the some of the important decisions that the pipeline has made before it arrived at the the feature set and this is exactly where it is and by the way there is checksum you are training a model you have to come back to me with the checksum so that I know that the training data set then from there we can go look at the logs audit logs and the dependencies and get to the ultimate even the last record that has been used to generate this so we found this to be incredibly useful it shaved off almost 20 30 percent of our time that goes in because a single ask about number can you will end up spending days getting to that underlying proof so of course lots of validation which is one of the things that was that we felt was missing in in pandas it does this manipulations of the data frame but how do you know that the output data frame is what you expected we are talking about even something as simple as counting the number of records it starts from there and then from there you are looking at the distributions of the if you are sampling for example the customers how does the distribution look compared to the original distribution and so on these are all checks that you need to build over a period of time whether you do it in the in the pipeline code or post that in audit and other search interfaces that is all you know you can decide what is the best way to do it so one of the things that we provided because invariably when a model is not performing well the first question is is it the model or is it the data right once the debugging of the code of the model is concerned it immediately comes to data now what has changed between today and yesterday why did the model give a different answer today compared to yesterday so there is a bunch of tooling around all of these things right figuring out the drift of both the model as well as the underlying data sets maybe the version of the code that generated the feature matrix itself has changed and that could be the source of the problem so we do not in our system for example you do not go and manipulate any of the code on the server the code directly gets deployed from github we actually track the git commit that is generating that output so that you can literally link it back to the code and see what has generated now all this is in many cases it is nice to have but what we have found is that this gives us immense confidence that we are actually computing correctly and that allows us to have a lot of good conversations with data scientists data scientists typically are overburden they have a lot of models to develop they keep moving from model to model it is hard for them to remember the choices that they have made and imagine somebody coming to the data scientist and saying okay three months back you scored you rejected the loan application of somebody can you tell me why you have rejected right that is that's a very long drawn out very painful process so all of this will help with some of those questions this is again this like I was saying you never build one model you are always building multiple models and second there is no customer of ours who doesn't have a very very aggressive data science roadmap they are all higher everybody we are happy to do the Good Samaritan thing if you are available please let us know we know fantastic customers where you can go help solve some of their problems so we are talking about how do you avoid reuse sorry how do you avoid duplication of the work right so one of the capabilities that you need as part of this platform and within your organization is to discover what exactly are you computing in your system as of today there are many pipelines on many different systems right what did it what are all these achieving so having a single place where I say ok this is the feature this was being computed by this and it is available this and by the way this is the statistical distribution then you can immediately say ok I have features that I can use for my model maybe I have to ask for new features and that's a nice conversation to have but it starts with do you even know what is happening in your current system so the what we have found over a period of time is that the first version of the model is you know really opens up a set of questions involving how are you going to nurture it over the next one year or two years right and you have to think through all of these different problems that we have talked about and different mechanisms that we have discussed now input correctness actually is somewhat my favorite because I hadn't appreciated how dirty and messy the data is get stuck here the modeling does not even begin before you have an understanding of what is in your data and whether it is complete and correct in most cases they don't even know what is in the data simply because the organization is very large in one customer case there was a transaction table that was coming in with 60 or 80 columns and this is these are from old Siemens Nixtof boss machines that these this company had deployed all over the place nobody actually documented what those columns are if it has sometimes it has 10 zeros sometimes it has 2 zeros so it took almost 2 to 3 months just to know what is in the data your data science has not even begun so one of the recommendation thinking I would strongly urge to you is that if you don't have a handle on your input data forget about the rest of them because you distrusting data is really important as a principle don't use it until you have verified and there are lots of stability issues sometimes it comes sometimes it doesn't come sometimes it comes with duplicates of format things we don't know whether that has correctness implications so we there is no I mean there are some open source libraries to check validate data but there is no obvious and a simple system that will apply in all contexts so we build in every one of the cases a health check that you know validates the data before it goes through any of this pipelines I mean remember that this is all happening after all the Kafka's and the data has reached the the data lake or your database or whatever is the storage mechanism the other checks beyond whatever they do this I already talked about there was a case when the models were getting messed up and we started looking at the the data and so we found that there was some significant change in a datetime column datetime column or totals column one of some column in the in the database and so we went back to the organization said tell me about this this column oh sir like you didn't realize in in 2013 we had a developer change that person changes semantics of that particular column earlier it meant this now it means this and that information is embedded in all their excels right the tacit knowledge and the business rules that were there and here we are struggling for almost 2 or 3 weeks trying to make sense out of the data the moment we knew what we had to go back and tell the customer and we are sorry you are 100 GB of transaction data from 2013 and 2014 cannot be used because they are not comparable and then we had to go off and do other stuff I can see that some of some smiles I think the it as data scientist in the data community as we progress it will help if we move from the fanciest part of the whole data science which is the model themselves to understanding it at a systems level and having end to end control because in most cases the models are useless if you if your data is messed up let me move on to some other stuff this is the key thing that you have to take away from there is economics of feature engineering it is upfront and it grows non-linearly with the number of features and each feature is very expensive right with these in mind you have to ask what additional value does each feature if a data scientist say I want to come up with a new model and the person wants a new feature to be computed you have to ask why? if I didn't have this feature what would be the accuracy with the feature what would be the accuracy does the ROI on the model justify the kinds of investment that you have to put in to keep this thing up and running everyday it should not surprise you that only one in about 50 models going to production right and I think that when we we start asking the hard questions only when we think in terms of the production right we know the famous Netflix case right where the recommendations they were giving a million dollars for improving the recommendations I believe that it did not going to production the complexity of the implementation of the alternative algorithm was way higher than the value of it that it could deliver fine so you want to ask a lot of questions how many models will I have what kinds of complexity will the features be simple or very complex and whether I have mechanisms like Google BigQuery where I can compactly represent some of these features or do I have to write a lot of spark code to be able to do all of this and of course how much volume are we talking about this goes back to again the economics of the features in all cases what we find is that beyond a certain scale you will need a feature engine platform simply because you cannot do it on your laptop anymore you cannot you have to be able to reuse across models every large organization that you know of has built their own ML platform overall and within that a feature engine platform so you want to think about what will be the roadmap and at what point you want to build your own and if you want to do it today we are early in the the development of this whole space if you want to do it today there are really only three options Gojek has open sourced tool co-authored by Google called Feast but it is very very GCP specific there is you can build your own but it takes a very significant amount of effort to stitch all of these very different systems that have not been designed for feature engineering as a end purpose in the last one there are wonderful companies like ours which can do it but commercial products have their own set of limitations costs and so on I expect that over the next one two years you will see lot more conversations, lot more open source lot more of our kind of companies one of the reasons we developed it early because we committed from a data scientist kind of doing a lot of modeling and then we realized that we need a ML platform of our own so we built one initially in my previous company from there he said no this is the first version this looks terrible let me go off and to the drawing board think about it and rebuild a version of it ourselves which we call the so you know the standard and benefits of open source versus stitching versus your own implementation or even a third party like ours my strong recommendation is that the ideas are not yet fully developed in this space people are still feeling their way through and much of those ideas are embedded in the product documentation from each of these companies you may want to look at them to actually read between the lines and ask what is their real objective right if I change the technology choice how will it look if I scale it down how will it look right and some fantastic thinking that is going on I will leave you with a quick take a base the key thing is that the difference between in my opinion ML engineering of which this is part and data engineering is that it is very very sensitive to the modeling process itself scalability versioning so the thinking that you need to bring to the table when it comes to ML platform design is different from that of the traditional data engineering right and we will have lot more conversation later today in the bof and of course some of us are around as well do you feel free to get in touch with us for any conversation around ML systems and productionization ok huge round of applause we can take one or two questions or one at the best and then we will be back after lunch for the combined Q&A so one question quick question so I saw that you moved from code to ML or some kind of language right ML, JSON I mean use any one it doesn't really matter what the specification I will just come down to the stage I guess it's easy yeah ok one last question yeah thanks so you mentioned that we should data validation after this data lake which makes all the sense because if we input like dirty data the model will not learn anything but I think this data validation would be more important after we do feature engineering because this feature engineering part is still a part where we can introduce a lot of bugs and if we are doing data validation before it then we might skip those errors and so on yes yes you are right it is not either or you do it before you do it during you do it afterwards it pays to be paranoid ok thank you ok we break for lunch we go back at 220 just one lost and found announcement there is an MI phone black colour as all mobile phones are and we have another one an iPhone 5s if anybody can unlock the phone they can take it so come prove your ownership it will be there at the at the AV desk both of them ok come and collect it from me hello hello audible go like few rows back yes yeah yeah so ok so I will just do the start only so what I am going to talk about is how do you build complex data pipelines using louder ok what I want to talk about is how do you build complex data pipelines which are evolutionary nature using contracts now ok is it fine what I am going to talk about is how do you build complex data pipelines which are evolutionary nature using contracts this one right this will happen ok ok this is better right is it audible ok Raja it's fine this is a better position this right that is a bit my time and in the middle of it you are like yeah also one more thing like if I would like take a quick short water break is it possible you can give a bottle here ok ok no because sometimes when you are going fast this is something you want to take a break you want to try just give me a minute it doesn't work from there it doesn't work from here you can't dream though yeah you have to be behind you yeah and this clicker stops working sometimes distance it's a USB bluetooth thing you can just use your yeah yeah open up your slipper stand up when the AV team is here so they might add the volume yeah I just have to make sure that breathing thing doesn't come because that's a this is better this is a good position actually check this when I come on stage this is a bit fun breathing thing ok all good right you can swing to change also no no no just press this button you will see the mouse button to use hello hello better I think it's in the corner enough yeah hello hello I think it's audible here ok sure sure thanks a lot hello so three kinds of patterns ingestion storage consumption patterns of millenium engines per second 14 TVs TV streams search thanks dude testing can everyone hear me breathing no excellent so this is a good angle for it yeah good ok hello we will start in about 10 minutes with the joint Q&A for data pipeline before that we will have the MI phone I think somebody is looking for it ok come and unlock the phone to take it he was making take all of them ok welcome back everyone hope you had a good lunch we are going to start the post lunch session with first of your announcements so flash talks are scheduled at 4.55 in audi 2 if you want to talk please come and give your names I don't think we have any names at the moment so people who are interested to talk about something that they are doing it's a 5 minute slot, maximum 7 talks no presentation required no laptop required unless you are doing a demo just come talk about something that you are doing get a chance to present what you are doing get a chance to get feedback from the community and start the conversations 455 flash talks will be in audi 2 if you brought the 5th elephant t-shirt then there is a t-shirt counter you can go and collect it there birds of feather session which are pretty much like what we are running here but even more informal and a chance to kind of interact talk about it, talk about your use cases, share those learnings will be starting off in audi 2 and audi 3 so look at the schedule and see if you want to go to any one of them please don't bring food and beverages into the auditorium and if you are interested in the tutorial tomorrow which is being run by Rajdeep Dua Salesforce on transmogrify kind of connect a little bit on feature engineering that the last talk we had on this session if you are really interested in that it will be in audi 2 from 11.10 tomorrow so have a look at the schedule get ready for that and if you are planning to attend that, that is tomorrow in audi 2 from 11.10 so before we go to the 2 talks that are scheduled we have this joint Q&A with the 3 speakers that were there we talked a lot about how do you start your data collection process from day 1 Kumar talked about that and start to get your data in Nitin talked about how we can start to prepare and ensure that those data systems are running so aiops and then Venkat talked about how do you ensure that data that you have got is managed in a meaningful way and transformed in a meaningful way into a feature data set and the process for doing the anatomy of that so we have so we have those 3 talks, really good ones around the entire life cycle or at least the initial life cycle of data systems I am going to open it up, we have got about 15 minutes for a combined Q&A if you have questions that got across that and you want to in the Q&A now is the time okay I think people are still hesitant let me kick it off with a few questions friends any questions or okay sure we have got a few I don't yeah hello so my question is for Kumar so you mentioned that you push your data directly into data warehouses in your V1 of data pipeline which is kind of not the norm with what I have seen outside also whether the data directly gets pushed to the data and then where it so can you give us an insight as to what pros and cons did you find because it did work before you moved to your V2 so the reason why we chose that anti-pattern is because we wanted the capability for a product manager or as a developer to kind of query data as real time as possible we did not if you remember we did not have any real time pipeline or real time dashboarding etc available in V1 because we didn't have time to build so what we wanted was as soon as our product or game is kind of downloaded by our first user we want to see what that first user did in the first hour of the gameplay and kind of if I have given the game to kind of a user on the street or somewhere if I am testing it on the field then I want to see the live data like what did they do how many clicks they made did they swipe too much did they touch a piece of the screen which they were not supposed to or what we wanted so to do this query we thought in terms of SQL itself and most of the stuff you will not know how do you want to see the data what's the right click stream etc that you can actually make real time charts about it so having that raw data available sooner was really helpful and so that we can query it in SQL so that's the primary reason why we pushed data directly into the data warehouse now obviously the core of it is very simple you can't keep doing this because your ingestion will take it all and any data warehouse you use behaves much better or much differently if you can kind of batch your ingestion in different intervals or different logic and at the same time if you are inserting it there and letting users kind of query directly that live data then obviously they are going to be fighting for the same resources right so you will have to figure that out what depending on your usage patterns or ingestion patterns we were able to do that and because in the beginning usage was far low there were a couple of users only who were doing this continuously but as we grew also right now also we allow that obviously we have to define the cues very strictly we have to apply some boundary rules query cutout times and all that stuff so that ingestion doesn't get starved for any reason and since we are able to do that for our use case we have been successfully running this but you should evaluate it for your own use cases any more questions yeah one here so I have a question to Nitin so yeah so like in one of your high level diagram I saw like you chose time series DB open time series DB so what was the reason for choosing that specific time series DB because currently in the market like few players are there some of them are like much more dominant like maybe droid or influx DB and that is the first question second is like how are we using time series DB in this specific context yeah sure so for us the time series use cases kind of at least the current part of the product are not very advanced the primary use cases we have because we do most of the aggregations in our streaming platform and the primary use cases which are and all of our logic the core logic of the product has to run in a very streaming way the time series databases themselves really power our UIs where we show trends of metrics and for most of those things it's a very look up heavy use case wherein I know I want to fetch data for a particular metric across a period of time so for that I think KV stores tend to behave in a good way we are though exploring droid as our next gen metric store because there are a lot of supporting related use cases which we want to power wherein we want to allow users to be able to look a slice and dice their data so that calls for really a read time aggregations which will kind of help which will kind of be something which will come out of the box with the system network so does that answer your question or looking for more details hold on just get the mic so yeah yeah so other than the roll up down what other things for which you basically put time series DB there so yeah so the time series database is right now storing the individual metrics it's also storing the baseline information for these individual metrics it is also storing a lot of our anomaly detection stuff anything which needs to be really be available for consumption on our portal is something which really lives in our time series database a lot of these other things because if you look at anomalies if you look at baselines there is a temporal angle to this and because the anomalies will be raised at a particular point in time so you want to really have a temporal view on a lot of these data elements and that's the primary use case for our time series database it's mainly powering our UI a lot so our processing itself is very real time and happens on stream okay we have one more question there yeah so my question is related to real time streaming one day so you had all immutable data but what about mutable data how real time is it possible or what would be the way you would go like if you're replicating mySQL data and you want less than a minute latency in your data lake or something that depends on how you want it when I meant immutable is primarily the way we want to use the data because if you allow for data to be mutable at some point in time and you have one use case in mind today trust me tomorrow there will be thousand use cases and since your warehouse cannot support this mutability of data like update or upserts for all of your historical data right you don't want to you add a column and I want to fill this column with this default value for my 2016 dataset hard to do at scale so we chose this that we wanted to be immutable and that is true for our use case that it was analytics events data that is going in but if it's some other kind of data maybe your payments and some order processing which has to kind of wait on some manual verification etc you will have to support it differently I will very highly suggest that in your traditional data platform mutable data set you should design such a way that they are very small and if you can live with not having mutability that is much easier to handle on the infra and the upside of it because you will not have to handle those scenarios or whether my SQL replication will finish under this or not whether they have to be in the same region same zones and then if I have to do a cross region then all the millisecond goes out of the window anyway okay any more questions well let me let me pose one question to Venkat while we wait yeah okay go ahead my question is to Venkat so lot of times when data scientists start building their ML models and before even they get into feature engineering there is a lot of trial and error that they want to do they want to see how things correlate how it varies across these categories and all so are you also exploring this area where you are directly querying big data and having a visualization layer on top of it where you can immediately see those results so typically for those kinds of things they already have mechanisms to do it for example ML flow will allow you to ML flow with combined with a bunch of other toolings allows them to explore the data even visualize and so on and then there are SageMaker and other data science studios our kind of work comes once it stabilizes a little bit right and then we are talking about sustaining it over a very long time the other angle here is that one of the side effects of building a feature engineering system is you actually generate enriched data sets to enable these kinds of exploration as well so we feed into the exploration tool but not actually substitute many of the existing tools just in the same way aren't we reinventing data modeling in like a just different term so like how does this differentiate from data lineage tools like TriFactor or Yeah, TriFactor is solving a slightly you know different problem if you will which has to do with let's say your fidelity and you have 20 different divisions and you want to know if a customer is there in two different divisions of your TriFactor of your fidelity then you will use those kinds of tool including there is a tamar and others so the way to think about this is that what has really happened is that it's not that ETL has not has gone away we still have the old informatica the newer age version of it whether it is dbt whether it is all tricks all of these are there and then there are these further systems that are in the in the same direction I think the need for a slightly different tooling comes about because the it is very very closely linked to the modeling the data science modeling for example if you didn't have the need to evolve this continuously right and if you didn't have the need for audit trails whether it is on the code of the data and so on some of these would go away right so my my suggestion and those are really solid products I mean they move petabytes of data this cannot substitute it has to solve a different problem okay one last question there and then yeah go ahead so first of all can I just his question a bit like so on the weekly yeah just ask the question yeah so for his one like we also have the same setup actually so we have data lake and then we have zeppelin which is more like dupiter for the big data right so we use spark jobs so you run your pie spark jobs or your skala written spark jobs and then whatever data you get you use zeppelin notebooks to basically visualize on top of it that is like what we use internally and my question is like what do what would you suggest as a tech stack or something for a cdc pipeline for your traditional art for moving data from traditional rdbms to your let's say big data systems right yeah I think which maybe have some perspective so like there's 200 options I think for this as well depends on the use case and the kind of data you have yeah so like in our case it's more the the sources are generally ideas so the issue doesn't come in while reading the data it's more like how do you manage the real time updates on the big data side like there are few options like kodu is there but it has certain limitations so like certain like on the big data side what would you suggest like as a storage engine where which can subject okay let me let me pause that question because I think it's a very broad question and we've gone down the tools track a bit I think what I would urge everyone to take away from these three talks is the mental model the anatomy or the cognitive model around how these things are done I think all the three speakers have tried to avoid tool specific discussion because they want you to go back and think around how to take these lessons that they have learned in a hard way and try and build it into your own tools right so continue the discussion outside because it sounds more tool specific than what the speakers were attending we'll stop now a huge round of applause for our three speakers there yeah thank you so much we okay we begin the two talks now around two sector talks so the first two talks in the morning were around algorithms and how people are applying it we had a good discussion on data pipeline data engineering feature engineering and the next set of two talks are talking a little bit about the harder things you know the successes the failures that we face around it so the first is around the software or the open source tools that we all love and the second is going to be about how to handle bad data right so I'm going to welcome Peter Wang from Anaconda and especially like to thank Anaconda Inc for sponsoring Peter's travel for the fifth elephant Peter is going to talk about his experience and journey in building Anaconda some of the tools that a lot of us use at least I personally use Anaconda a lot so welcome Peter and over to you a round of applause for him okay great thank you thank you very much it's great to be here I just flew in this morning so I might not have as much energy as I normally do and for that I apologize but try to keep it lively I'm Peter Wang I'm the founder of Anaconda and I'm currently the CTO Chief Technology Officer and today I'm going to take you through kind of the journey we went through at Anaconda both as a company as well as through this process of growing the data science community, growing the open source community around data science oh I have to turn this on there we go so my background actually my formal background is not in computer science but actually in physics and actually one of my colleagues from undergraduate here and we had a lot of fun back at Cornell and I was mucking around a lot with computers I've been coding for a long time but after I graduated I went into software development as a career and I only took three CS courses in college I had a lot of fun with them they were easier than the physics courses by far but essentially what happened is after after a number of years doing very complex I ended up doing consulting around the scientific Python stack in kind of the mid to late 2000s at a company called Enthought and built a lot of kind of early tools there and then around 2010 time frame I realized that the big data movement or revolution some people were calling it that the big data wave was going to drive a dramatic need for better data analysis tools than what was around at the time Python was starting to get organically adopted in a lot of different places and so I saw there was an opportunity to really push this scientific and numerical Python stuff that we had built so Travis, my co-founder, and I have been building a lot of these things we saw an opportunity to take that and really drive it in business so there it is so our company's mission from the very beginning has been to build better tools to help organizations harness the power of data science and data analytics and really put that power in the hands of people who are subject matter experts or domain experts so we make software that the data scientists use but also very importantly we provide software that businesses can use to empower those data scientists so many of you may be familiar with Anaconda primarily as the software distribution how many of you know Anaconda for that open source right so great so this kind of we call this a architecture diagram and this architecture diagram shows you the ecosystem of things that happen to make the Anaconda distribution what it is and it's a very unique piece of software actually in that it is cross language cross platform and multiple architecture so I mean you can go and run Jupyter notebooks to write R or Python you can use Jupyter notebooks on a Mac or on Linux or Windows you can use it on ARM or on IBM mainframes or on X86 so it crosses a lot of different matrices this makes our build farm incredibly complex but it means that for all of you who use Anaconda that you can write something and take it from one machine and run it on another machine or other set of machines without worrying too much about that and one of the key pieces of technology inside Anaconda that makes that possible all right is a tool called Conda which you may have experienced or used as a packaging manager and what it really is is an entire sort of framework for building open source libraries or building code in general a building in a way that's cross platform multi language potentially and most importantly built in a way that is sandboxable that sort of self-contained so it can be installed without having system administrator level privileges and this is actually one of the reasons why Anaconda is very popular in the windows world because many of our business users have windows machines at work they want to use so many of the Python libraries but their machines are locked down so Anaconda can install into user space and you can still install a lot of additional stuff but the fact that we do support multiple languages and we can put these together in these virtual environments that makes deployment easier it makes reproducible data science easier it actually is a much simpler thing to use than Docker and things like that and it's one of the reasons that our enterprise product is also very popular so our I'm going to stand right here so I don't have to keep doing this like punching my laptop so many people know us for our work in open source but then I always get the question well so how do you make money and people assume that we do consulting or training and things like that and we do a little bit of it but primarily the way we make money is we sell our commercial platform which is a completely different set of bits than the software that you download and use on your machines as Anaconda distribution and what the enterprise platform does is it uses underlying technologies like Kubernetes and Docker to allow data scientists to work in a corporate environment that's relatively locked down and still have easy one-click deployments of Jupyter notebooks, of dashboards of machine learning models APIs and allows the IT folks to govern and audit all of that so it's a very it's a very powerful product and it's used, you know, at a lot of really big companies it's also used in some smaller companies but when businesses are serious about managing the software supply chain that feeds into their machine learning that's when they call us up about Anaconda Enterprise but today's talk is not really going to be about Anaconda Enterprise it's about the overall journey that we went through and many of you may already know this and I apologize if this is a repeat but I think a lot of people don't know this and how many of you here are practicing data scientists? Show of hands. Great. How many of you have been practicing data science for more than five years? Not that many, right? So the journey takes us back to seven years ago my co-founder Travis and I we had some visions around 2011 2012 timeframe and what we realized was that the reason why the scientific Python tools were getting so heavily adopted in the 2000s as a displacement for MATLAB was because scientists wanted to write code like they wanted to express their ideas in code they didn't want to have to write a spec they give to a Java developer to code poorly they wanted to write the code themselves so they could then go and fix it when it was broken and they could also build it quickly without into production build a thing that they could use to drive some lab instruments or build a model that goes in trades on Wall Street so Python was getting picked up for these places we were getting called by everyone from the US Army and the military to the most prestigious hedge funds in New York and so what we realized across the board was domain experts, subject matter experts they wanted to manifest their ideas in code and Python was the weapon of choice for them and another point that we realized was that there were two things happening right at the turn of this decade which was that the big data wave was starting to really crescendo and reach its almost peak and that was going to completely blow up traditional data warehousing it was going to completely blow up traditional business intelligence tools people were struggling with excels like million row limit people were struggling with very expensive old tools like from Tipco and whatnot and the second of all cloud computing was starting to hit and it was mostly being adopted for web development but we saw the potential for data analytics the fact that people could go and rent a machine or a hundred or a thousand machines and get like a mini supercomputer on demand to crunch through a lot of vector math that was going to be a game changer on top of the big data thing so what we did actually in 2012 was a year after the jupiter notebook had been actually at the time it was called the ipython notebook but within the first year of it being released we created an online notebook service called wakari that you could just click on it go to wakari.io and you would have a running ipython notebook and it just worked how many of you actually used that or heard of it anybody here a few people good some old timers the problem is we were way ahead of the market nobody could spell ipython ipython hadn't renamed themselves jupiter yet it was way ahead of its time and it was also people would use it for free and no one would pay for it so we had to kind of move away from that product and figure out something else but we were on to something in terms of realizing that being able to turn on cloud resources tie them to interactive computation environments on top of large data sets this is not familiar to anyone right this is a very useful thing and then one other thing we realized was that what we were seeing as we're doing our consulting work using python that there were these different silos in business most of the business big corporate environments we went to and I have a very US centric view of this of course but maybe this kind of resonates with you all over here as well but we had business analysts who were used to using things like excel and bi tools like rapid miner and things like that we had developers right your traditional CS majors who were programmers java.net c++ etc etc and in the middle there was a small tier but growing there's small tier folks who would be doing things like predictive analytics using sass or they might be writing some matlab code as a quant at a hedge fund but more and more people in the middle were starting to use python to do a lot of those things and the goal here in the middle is not so much to just produce an insight or produce a report rather to build out machine infrastructure to build lots of insights and do lots of reports and actually do things with those so we saw that not only was this middle tier of folks going to be growing but additionally python as a language spanned all three of these things if you take some jupiter notebook that an analyst can read and maybe click on a few things and change a few things and have it still run that same notebook you can give to a c++ or java programmer and they would still kind of know what to do with it maybe they're going to you know complain about white space or something like that but they could it's going to be a lot easier for them than trying to understand a pile of r or matlab right and so we saw that python as a language that could cross all these different silos was itself that intrinsically was going to be another unifying connecting and a very important phenomenon so we started a company and this is sort of what it looked like when we first started out of cona that's my truck with some used furniture in it that's my co-founder travis and then brian van deven was actually the creator he was the first commit to cona he's also the creator of bokeh and I sort of created bokeh together this is our first strata conference and travis is pointing to our logo our first booth there our first office it was a very the company was called continuum analytics at the time it wasn't called anaconda and there's a fun story behind that actually so we created this company to push the use of python for data science because we really like python and so one of the things we did was we created this thing called pydata so I registered the domain name the very first pydata workshop and google and riley helped us put it together and it was at google's headquarters and mountain view it was a small workshop, 60 people but we had, it was quite a workshop because you had west teaching how to do time series in pandas you had francescal ted talking about optimizing numpy the late john hunter talking about advanced map plot lib jacob vonderbloss teaching you machine learning with scikit learn it was just Fernando teaching ipython and map plot lib, travis teaching advanced numpy this was an amazing cool event and it was only like $50 to register for it only happened once but based on the success of this we started doing a lot more pydata conferences and things like that and now numfocus runs pydata events around the world and at that pydata though a very important thing happened which is guido step by guido is still working at google and so he came over and he was talking to us and this is maybe more of a I think this conversation they're talking about something else but we actually have it on video where we approached guido that the scientific python community did and we said packaging is a huge pain in our butt can we do something about this can you convince python core dev to take packaging of c++ and fortran code seriously and guido was like you know what number one I don't care much about packaging that's why it's a second class concern in the language and second of all these cases are so exotic you should build something that works for you all and so we did and so we built anaconda something is not going right there because what we needed to do was we realized if we wanted to ship python for big data if we wanted to ship all these advanced tools like just-in-time compilers and all these other kinds of crazy things web frameworks for visualization we were going to have an even bigger bigger packaging problem that we already had just trying to ship numpy and sci-fi so we started the company to try to build all these great capabilities in the python space but to do any of that stuff okay this is I'm going to stand behind the podium here and do it this way in order to do any of these things we had to be able to get the software into people's hands and so we wrote a package manager called conda and we built I remember the moment I turned to travis at the meetup at that conference we decided to create our own distribution of python because this is going to be nuts to try to get people to install the stuff and because this is python for big data we should call it anaconda and that was the moment of inception really was in March 2012 and so by September we created the first version of the distribution we added window support which has been useful for some people I guess and we had all sorts of other fun things that we did and starting from that point you can see the that was an afflection point and it bent the adoption curve of python because prior to that point even though let's see that plotlib was probably 7 or 8 years old and sci-pi was 12 years old people were all the time coming up onto mailing lists and complaining I can't install this this compiler is blowing up this is not working this is an anaconda it made it easy for everyone to install all this stuff a lot of that stuff just went away so I think we really helped drive the adoption of python and to show you what that looks like now this is our weekly unique engagements on our anaconda repository so it's over a million unique IP addresses every week on a monthly basis it's about 2.5 to 3.5 million and to give you some sense of scale how many of you have heard of Tableau yes everyone rapid miner maybe some people write their data mining tool the entire Tableau user base is something on the order of half a million users the entire rapid miner user base is on the order of 440,000 people those are numbers from those companies themselves so we in our last trailing 12 months trailing 12 months unique for anaconda our repository, our download sites and everything else is 16 million so this is an absolute movement that you all are part of it's growing and open source is at the heart of that now packaging packaging was not the only problem we had back in 2012 number one people complained about python being very slow and in fact some people were looking at julia with great envy in fact the j and jupiter julia python r so we didn't want people to have julia envy so we used lvm and a couple other hacks and created a just in time compiler called numba which is now becoming quite popular as a way of very easily accelerating your python code without having to step into c++ or anything like that also people were doing a lot of r versus python stuff back then and there wasn't a way to do shiny for python people said I will walk away from r except we created a library that would suffice that particular need and then also there was this question of scale out and there was multi processing and there was a lot of other things that you could do with map reduce but it was clear that the world was going nuts over hadoop and map reduce and we needed something in the python space so it took us a few years to get there but we created a framework called desk which is a much nicer idiomatic way of doing scale up and scale out in python and these projects continue we created those projects and they continue to add features and get more stable and get wider adoption so numba now you have very nice gdb breakpoints from inside your python functions you can build dictionaries basically write python code with dictionaries and that all gets compiled to extremely fast height machine code and of course numba has always been an excellent way to easily target your code to gpu's as well as cpu's desk has been getting a lot of great interop with yarn and kubernetes also we're adding support for gpu's so we can load balance across them and more and more libraries in the python space like scikit-learn and pandas are keeping desk in mind as a way of doing scale out and parallelism and then on this side we've created the next gen set of things to do dashboarding in python you can use any number of viz libraries and give you interactive widgets and very nice dashboards either standalone or within jupiter notebooks and we've also added large data visualization support and we've just been through our work with some of the us army research funds and nasa and whatnot we've actually built some really wow can everyone hear me okay back there is this okay in the back can you hear me okay we've been building some really nice large data set visualization capabilities so I don't know how many of you have heard of data shader but we can actually render billions and billions of points in an interactive jupiter notebook on a laptop and these are some of the plots that we've done with those capabilities and we're still creating new ones so just last year we started a new project called intake which is a lightweight pythonic data catalog it's not as heavyweight as something like tamer it's actually not meant to be like you know immuta or tremio but it's something very simple for you to be able to run a little server that front ends a bunch of data and be able to connect to that server remotely and drive some computation on it so simple lightweight pythonic and it's starting to get some nice grassroots adoption I'd encourage everyone here to take a look at it if you have a chance but in doing all this stuff and this seems like a fairly linear sort of path we learned quite a few lessons right the growth of the company that we did from 2012 to now was during a time of great churn and great innovation in the data science and machine learning space we saw big data come and go then we saw the rise of data science and even as people were questioning the validity of the term data science we had machine learning rising even as people were googling what machine learning meant we had now AI so I don't even know what comes next quantum blockchains I guess but you know the last 5-7 years have been just layers upon layers of stuff and I kind of break it down like this so the first 3 years of the company we really spent in kind of this open source consulting bootstrap sort of mode and the industry was sort of peaking on big data data science was emerging and our biggest competitors was just our own potential like crashing into the ground and making a big crater we could we had failed to push Python for data science people could just go and do other things and also we were competing a lot with other things like Kadoop and Spark and R Python was not an established force many people to my face would question my sanity and pushing something weird like Python for the stuff by the time 2015 or 2016 world round those in the know they know that question was settled but we ourselves as a company were transforming into becoming a product based company so we took a fairly robust series A and then we had to rejigger from our services and consulting training model into trying to orient the product engineering group was building commercial enterprise grade product and now we brought in a sort of a seasoned veteran executive as CEO we've hired in a fairly veteran management team and together we're growing the company and trying to figure out in the space where there's so much noise how can we build well engineered products that do what our customers need but still manage to kind of keep on top of the hype cycle and also stay true to the covenants that we've made with our open source community and through all of this I mean these are some business challenges there's technical and landscape and marketing challenges but through all of this the biggest challenge has always been about people whether it was about the community at large and the data science personas who are trying to service or whether it was about individual library authors and innovators in the community whether it was about our own internal employee culture throughout the entire journey the hardest thing has always been about people even though we're a very deeply technical company we had plenty of technical challenges and business challenges it was always about people and one of my favorite quotes is from the not from the little prince but from the author of the little prince in a different book actually and he says there's but one veritable problem is that there's a lot of technical challenges and relationships throughout the entire time that I've been building the business throughout the entire time I've been trying to build pie data community this just constantly rings true for me and and you know it's kind of a touchy feely kind of of a point you know I can imagine some people in the audience are just like well okay whatever but I mean that's nice about there's ton of stuff going on saying that we need to be nice to each other doesn't help any of these things and in fact actually I think the challenge that we face really in the in the space we're not just building random web services we're actually building machine learning systems we're doing a lot of deep data driven analysis and we have to communicate these to sometimes less than savvy audiences and so our challenges even though they may appear to be technical you know that when we ask those engineering questions like what is the right answer what is in the right problem what is the right thing to do what technology is good is a turn as a progress how do I actually those of you who really want to do good work you're probably asking yourself how do I actually engineer from principle as opposed to just throwing something on a wall until it sticks and there's this great quote you know sort of paraphrasing Isaac Newton says scientists progress by standing on each other's shoulders and computer scientists progress by standing on each other's toes and there's a whole lot of toe stomping going on really in the industry although I guess if we go to quantum toes then we won't overlap but there will be less squealing but really you know our industry the computing industry in general lives in what I call the long shadow of the 70s so many of the things that we still use were invented in the 70s of course we still use some things invented since then but not that much actually if you're not careful sometimes you'll just download binaries on the web that machine learning libraries or open source libraries and they're compiled with the Pentium target which dates from the 90s they're compiled with very baseline instruction set architecture not optimized for any modern chip a lot of you probably are in shops that use a ton of Java, Java is very old very very old language at this point and the things that are really popular still like Excel it's basically the same thing as VisiCal and any of you who type DIR or make DIR into Windows DOS prompt you're using essentially a clone of CPM from 1974 and certainly many of us use Unix and these Linux primitives the Unix model all of that stuff dates from the early 70s I don't know how many of you drive a 1970s car but you use 1970s computer technology for the most part and so now on the one hand this is really great it means that we invented some really solid abstractions on the other hand it might mean we're really stuck so in the face of this how do we distinguish progress and churn one thing I can suggest I've been programming for quite some time but throughout my entire tenure as a developer I found very very few books that I could consistently and constantly recommend and the Pragmatic programmer is a book that I recommend without reservation all the time always if I could recommend one programming book to my children it would be this one and in it it lays out some of the basic design principles to build robust software systems now for data scientists there's an additional set of concerns but as the artifacts that we build in the data science community become more and more like software and less and less like a bunch of random data scripts then we need to consider these things keep these things in mind it's just absolutely a fantastic book but another book that augments this one really well is called the Clean Coder and the Clean Coder doesn't talk so much about design and programming technique but rather it talks quite a bit about the personal development that you must undergo your who you are as a person and the way you must look at your work if you want to be a professional developer I think this sort of carries over to data science in a way as well and some of these things again these feel like very touchy-feely kinds of things but these are absolutely critical if you actually want to build good teams that can go and actually answer those questions of what is progress versus what is churn one of these really important things at the end of it at the end of the day it's about continuous learning and it's about humility and in fact the things that I see whether it was the python community and the pi data community whether it's within my company the things that held us back the most was when the team failed to have trust with each other when egos get in the way the smartest developers the people who built things like conda solving extremely hard build problems they could sometimes be incredibly stubborn and would resist talking to other people who are building other amazing things and it was very frustrating for me as a manager as a founder to have a founding put the team of really smart people together and have them just be at dunderheads with each other so obviously very smart people both obviously mission aligned but they in those cases when they had that kind of conflict neither party really reached inside and figured out how to be vulnerable how to build trust with the other party how to actually express humility in a way that could allow the team together to progress so the two things that I found without this process the two things that you really need to have in order to have vulnerability you must have self confidence not arrogance but actual self confidence the confidence to say I know this and I don't know that and then of course the group and you can only say I don't know that if in a group you're not going to get kicked out of the group for saying that right for showing weakness and so one of the things that is incredibly important is as a manager as a team lead is to facilitate creating that kind of supportive structure where people can openly talk about what they did right what they did wrong what they know what they don't know and once you have that then you can actually progress and move forward so one of the core company values we have at Anaconda is we hire for people that have both ability and humility now humility doesn't mean false modesty humility simply means recognizing that you are probably not the smartest person in the room on everything you are probably pretty smart about a lot of things and your job therefore is to figure out what other people are smart about and what other people are not smart about so you all can be smarter together and that's what that humility is but it takes a group it takes a village to build that culture of humility and we have had a few painful exceptionally painful lessons when people fail to have humility and it cost us a year or more in terms of a product release cycle it cost us many good people leaving because people in the leadership positions didn't have humility this is a very painful and difficult thing to talk about because a lot of good people put their heart and soul into this stuff by the end of the day the failings were very simple I put them up on one slide and so I know when I was talking with some of the organizers here about what to talk about here at the fifth element and here in India in particular a concept that was brought up was that really certainly in Indian culture there tends to be a tendency to want to be very credentialed there's a tendency to really want to put your persona in a facade a very projected very strong and a very strong facade and the danger and the cost of that in settings such as a company or a technical team the cost of that is that everyone is then blind to your systemic risks for data teams in particular for data scientists in particular that is exceptionally dangerous because you all represent the brain of your organizations and we all know, we've all played with data you can find whatever you want in data you can tell whatever story you want to and if it's all about posturing it's all about making yourself proving that you're smarter than the other person you're going to end up really hurting your organization so on that note so what comes next so if we have done all these great things to kind of get data science in python to where it is now coming up next one of the things that I think is really really important to drive home is just how much hardware innovation will be coming down the pipe how many of you use GPUs in your work on a kind of regular basis anyone? not that many people how many of you are interested in using GPUs in your work a few more people how many of you would like to make tons more money doing deep learning alright there you go you're going to start using GPUs then how many of you are interested in IoT so now you're going to start running your machine learning models in things that are not x86 probably so this is just a quote from ARM but really I was shocked it was the last year I started taking stock of how many new kinds of chip architectures were coming out that we would have to support because again we have to build stuff to run on all these different platforms architectures so for me it's a bit of like a sobering thought as well but as of 2017 these were the chips that have been announced the kind of architectures we sort of figured we'd have to support people were talking about deep learning we knew we'd have to throw GPU plus CPU kind of workloads together but by the end of 2018 the number of chips that have been announced was way more and there's more you know this list just keeps growing and so what this means is that you know I put this up here just to sort of stress that many of the people in the IT space who have been doing software and hardware and infrastructure and system architecture maybe even doing this for a couple of decades you know what it's about throw a Linux kernel here put a network switch over there you kind of throw some data things around all of that is actually going to change quite dramatically and the innovation space that is going to be that we said at the intersection of that innovation space is going to grow by leaps and bounds in the coming years and there will be a tendency to say well gosh that's a lot to keep track of I mean look at all these chips I have to buy all of these to have to go qualify my algorithms all of these things no one's got time for all that so there's going to be this temptation to outsource the problem right oh here is this vendor they sell best in class yada yada pipeline we should buy that pipeline and the thing I would stress to all of you as decision makers influencers practitioners the thing I would recommend to all of you is to be very very clear with your management with your budget authorities that you cannot outsource innovation in the space if someone is selling you a predict button they're your competitor they're going to take your data and they're going to out compete you right and there's no big red solve button there's only a big red on me sort of button and generally don't want to pay for that the the thing that I think what we really have to do as an industry again going back to that concept of engineering on principle I think we need to move past the 1970s and 80s view of software infrastructure networking hardware we actually move to viewing the entirety of a computational system as an inference engine and looking at characterizing the qualities of that system managing it auditing it engineering it or debug ability and manageability that's a really really huge cognitive leap for a lot of businesses because they're so set in their ways now I know many people here work at startups younger companies that have a lot more flexibility but you're also under the gun to deliver much under much tighter timelines those of you who work at bigger companies start thinking about the start thinking about how can we build cross functional sort of inter interdisciplinary but across the different silos of management how can we actually start thinking about this overall thing is one unified holistic thing right now you know people talk a lot about model management and the the ugly truth is models consist of two things data plus software well most people are not really governing their software very well because they throw everything in a docker file and just pray that doesn't get hosed and then they're not governing their data very well because the data governance either it's good because it sits in a giant slow enterprise data warehouse or it's hardly governed hardly really you know managed because it's streaming through or you have a very ad hoc process with a bunch of S3 tables so there's a whole discipline that has to grow up around managing both of these in concert holistically and I think one of the takeaways here is the reason what's the difference between a machine learning model and a data science model how's that different than just regular software development this is a question I get asked quite a bit when I go and talk it you know in house at places and the way I characterize it is that traditional software development really puts data as an afterthought it has input to the system and no matter what values you get the correctness of the function is orthogonal to the values machine learning models are not that machine learning data science kinds of applications the correctness of the system is highly dependent on the values that go into it and that is a complete shift complete paradigm shift from how any of these things get managed in IT today and so two things I think that are going to happen in the next several years we're going to have to redefine the entire stack of governance management and audibility around these new computing systems the software development life cycle the application life cycle management sort of practices are going to have to evolve and lastly I think we're going to have to re democratize computing basically building the kinds of things we're building the natural end point of evolution for things like jupiter notebooks I think is actually serverless computing a new version perhaps of the python interpreter that doesn't require being grounded in a linux kernel running somewhere on somebody's cloud and thinking about how we can go beyond just python users writing a notebook but all of the excel users being able to write simple expressions inside code cells and actually going to a much broader use base for the python language that actually is an incredibly powerful concept and it sort of completes the full vision of what Travis and I were trying to do when we first started this whole thing about giving the people who understand the domain giving people who know how to ask the right questions giving them the power to answer those questions and of course all of this will require infrastructure on demand so if you don't have that then you can't even play this kind of game but I think these two things number one this wholesale redefinition of the entire stack of management and then two democratization of computing around a python centric a roughly python centric and certainly a very data oriented set of workflows these are two big trends that are going to happen over the next five to ten years without question thank you very much thank you Peter okay thank you Peter we have a joint Q&A with you and Chris after his talk so I think since we're nearly out of time we'll just hold the conversation hold the questions and the conversation and after Chris's talk we'll both get both of them to come and answer all your questions so stay stay here so we heard about the open source tooling the open source business the open source challenges I took a lot away from that conversation we have the next speaker Chris from simple and I think the other challenge when we work with data is around bad data right so you know I visited I visited a training session the last two days and even though I was talking about how to think using data or take data driven decision making I couldn't get the team to move and bad data now we don't have the data that is correct we can't actually make decisions because people don't believe on the data and I think every one of us probably working in data has had those conversations and challenges right so Chris is going to take us and I really like the talk title the final stage of grief about bad data is acceptance so hopefully we'll all be at acceptance at the end of it Chris hello hello that works are we good here okay so while they change the battery I'm Chris Duccio I'm the head of data at simple we're a pay later engine and so basically I'm going to leave a somewhat much more on the ground talk useful for day to day people who open up Python and crunch data every day so here's a question for the audience how many of you are happy with your data set does it answer every question you want answered does it does it work for you could it be a lot better who's really happy with it yeah I didn't think very I didn't expect to see very many hands on that so I've gone through many stages of grief I've tried constantly to get my data better and eventually I just sort of approached the idea that I'm going to accept that my data is bad and I'm going to see what I can do in spite of that so this talk is not about improving your data your labels are wrong a lot of stuff is missing and it's going to stay that way if you can fix things at the pipeline stage you absolutely should do that but in most cases you can't do very frequently to things beyond your control so this talk is about taking that into account and drawing correct inferences anyway it is a bit pessimistic but at the same time I do consider it realistic so the recipe I want to get across in this talk is instead of treating your data as a reflection of current reality and instead of treating it as something you want to use to predict the future all by itself instead treat your data as a reflection of an imperfect picture of what's actually out there in the world so I'll sort of talk about the present and the future and so normally speaking the way I see most people working myself included as often as I can is I'll do some data cleaning but apart from that I'll assume my data is correct let's say I got a credit score from a vendor I'll assume that credit score is true and I'll train a predictive model hope for the best and if the rocker you see is good enough I'll push it to production what I want to convey in this talk is there's another approach which you often have to get to is you get unfixably dirty data let's say 10% of your data is just missing for some reason or another or it's just wrong somehow what I want to what I want to suggest here is instead of treating your data as your input treat your data as the product of a random process build a model for how your data went from reality to what is in your data frame then you build a model to predict current reality from your data and then use the outputs of this model which are better reflection of the world outside your system as the inputs to your final model which is trying to predict the future so most of this talk will be going through examples of this in a little bit of detail you should treat this sort of as a recipe you can follow yourself none of these will actually apply to your problems but the ideas will hopefully suggest a way to work so missing data is a common problem and here's a problem I a situation I found myself in so I was building a product that was doing what is essentially a funnel analysis so we have a whole bunch of visitors that land on a website we ultimately want to take them to check out and we want to know the intermediate steps how many will add a product to a cart how many will click the purchase button how many will actually enter their credit card and how many will finally click the purchase button and it's called a funnel analysis because these diagrams if you were to draw them the other way looks kind of like a funnel as it gets narrower approaching the bottom so I started looking at my data just diving into what was available it was actually quite difficult to get this data there was more than a month of delay because I was first told SSH into this server and you can access it the server was actually running at 100% CPU and 100% disk it turns out to be the production server so then I asked them for could I just get access to the backups why are you giving me access to production they're like backups can't you just SSH in and do it there so I'm like do you mind if I take down production which obviously led to about a month and a month later I got actual backups so this is ultimately the data that came out oversimplified and so it looks pretty straight forward I have visitor ABC he entered his credit card at 12 and he purchased at 1 I have visitor DEF entered his credit card but decided not to purchase and then I have visitor Jkl who never entered his credit card but somehow managed to purchase I mean I would love to be able to do this on Amazon so let me remind you why it took me a month to get this data there was production server running at 100% everything there were no backups so just going in I kind of felt like maybe engineering best practices weren't being followed with this system so I inquired a little bit about the architecture so there's a JavaScript tracking pixel which tracks all these events, sends them to a single production server which is collecting quite a lot of data per second for some reason it's being put into hundreds of SQLite databases so there are easily hundreds of databases running per disk which for those of you who know how disks work let's say it's a spinning disk, a spinning piece of metal and each time you want to write to a different place it needs a spin, write spin somewhere else this is really not an optimal way to go traditional database will have one per disk at most and it will spin write a big chunk and essentially try to minimize disk seeks but if you have two disks two databases running on the same disk none of that works so I was just thinking is it possible that this data collection system itself is the reason for this is it possible that this system that I didn't write and have no control over maybe is losing some data so at this point what I want to do is I want to construct a probabilistic model of data loss so let's read our recipe a naive way to do this would be just take the data I have and I want to calculate a conversion rate which would be the fraction of people who make a purchase so I take the purchase time and if it's null I'll call that a false and if it's not null I'll call that a true so that's when a purchase happens and then I now have a boolean value which is either true or false I take it to mean and I will get the fraction of truths this is pandas code I'm sure it's slightly different in r but that's the basic idea and that will give me kind of an estimate of the conversion rate the idea I want to describe is I want to identify latent hidden variables which in this case are basically going to be what actually happened in reality how many people actually made a purchase how many people actually entered their credit card and additionally what fraction of data points is this server losing just due to all the odd things about its design and then the final model it's a very simple model it's just measuring a conversion rate but the point is what's going into this is a statistical estimate of the number of purchases made as opposed to just the count of purchases made by this tracking pixel so what we want to do is we want to model the generating process so the probability of a user of the visitor entering the credit card is some number which we don't necessarily know but that's one of the things we want to find out the probability of a purchase conditional on a user entering a credit card is another thing we want to know we also don't know it but these are the goals of the system we also want to know the probability that an event will be observed given that it occurred so if a person enters their credit card what are the odds this will be reported to us and as we kind of inferred from before that number is significantly less than one so the first two blue ones are interesting to customers and the red is interesting to us because that's sort of telling us what's the gap between what we built and what the system needs to be so here's some data reported to us summarized we had one lack unique visits 40,000 cases with a credit card entered but no purchase another 10,000 with a credit card entered and a purchase and 5,000 with no credit card entered but still a purchase so what we want to do is we want to back out what actually happened here this is a probabilistic model that comes more or less from basic arithmetic the fraction that entered the credit card isn't read because we don't know it but it's binomially distributed there's one lack visitors so that's the end of the binomial distribution and the probability is this unknown parameter the fraction we're going to enter their credit card the number we observed is again a binomial distributed variable coming from the number who entered their credit card which is unknown and the observation probability which is also unknown so in the second line the 40k is what we know literally in our data and everything on the right side is unknown and then as you work your way down this is basically ordinary probability theory just with a whole bunch of unknown variables that we want to find out so what we want to do from this is we want to actually find out the conversion rate so this is a probabilistic model it's based on some assumptions we're making about how the world works and more importantly how the data comes to us so pi mc is a great library when you have some complex probabilistic model and you want to come up with estimates from it this is if you notice this code is almost just this you write down your probabilistic model since it will typically be done in the Bayesian method you need to come up with some priors so for instance for the conversion rates I'm taking a uniform prior all conversion rates are equal equally likely, sorry not equal equally likely and then the rest of it is just describing our mathematical model in terms of both the things we know and the things we don't so pi mc is great if you want to find out models this way press solve it will do Markov chain Monte Carlo and it will give you probabilistic estimates and so the end result of this is if you do pi mc filly it will actually give you a range I'm just reporting the midpoints so with this implicit statistical model even though the conversion rate we can observe directly just by counting is 15% the actual conversion rate is 16.7% which is 11% higher than 15% and this comes from basically inferring that our rate of data loss is 10% and that means that we're actually having 10% more conversions approximately than we are reflected in the data so this is the key implicit modeling assumption we enter a credit card before you submit to skip this step so if this is false the whole thing breaks down but given this relationship in where our data comes from it kind of the rest of it falls out from simple probability so this is not as simple going to be as simple as just feed all your data into a deep neural network or into gradient boosting and expect something good to come out you really have to think about all the processes happening in the middle the key point is the data which is present and once you can come up with a better estimate of what's missing you can draw useful conclusions here's another problem that happens quite frequently data is just wrongly labeled formats are just different across different sources and you have to bring a whole bunch of sources into one so here's a problem I face that's simple this is my phone it's a fancy google pixel I'm mostly in Bangalore sometimes I would see in laws in Hyderabad um and so typically speaking simple should kind of expect my purchases to be in these locations they should expect my purchases to be happening on this kind of a phone suddenly simple gets a valid request to make a purchase from this device whatever the heck it is some weird looking Nokia thing in Jaipur I went to Jaipur once it was pretty probably won't go back does this seem right is it possible that maybe it's not actually me it's someone else who would like to make a purchase on my account most likely and so probably this transaction should be stopped someone should call the actual registered phone number confirm the purchase have you given your OTP to anyone on whatsapp we don't use whatsapp ask some probing questions of that nature so brilliant plan if devices don't match the historically used devices let's flag it as potential fraud escalate to the fraud calling team don't let the purchase happen call the customer up check if they've shared the password check if they've given out one time passwords that kind of thing so then I looked at my own data this is my device history across several different simple merchants and as you can see that capital Google capital pixel to excel is not equal to the other thing there's also a version with a plus which is some kind of url encoding there's all sorts of funny variations that happen like this and basically what's happened here is that these are all coming from different merchants who've implemented our system these merchants haven't all implemented it the same way there might be other steps of the pipeline somewhere in between the merchant and me that also change this stuff so there's a lot of people involved in this process there's the merchant partners and there's the business development team that's usually the one communicating with the merchants there's a product manager somewhere in the way who helps integrate merchants with simple there's the engineering team engineering team actually does this almost perfectly so getting this fixed would be a team effort I don't really believe that teamwork makes the dream work our merchants you probably you can go outside to our booth and see who they are mostly they don't care that much about stopping a small number of fraudulent purchases on our platform a fraudulent purchase on their platform is still a purchase they get paid we're out the money our business team really wants to close the next big sale like it's very reasonable for this not to be their top priority so getting something like this fixed is not going to happen quickly and quite likely may never happen and I'm just going to accept that instead I'm going to try to solve this problem with linear algebra which I actually enjoy a lot more than trying to wrangle up a whole bunch of people to finally feed me good data so sorry this is my phone I'm going to start with the data the strings I have in my data are a reflection of it but they are not actually reality so I'm going to model things this way the data is something that comes from reality but is an imperfect reflection of it my data set is of course going to be exactly what you see here it'll be my user ID there is a specific partner merchant where the device is being used to model and this is a function of my data but it is a function of reality but it's not true reality so since I like to solve problems with linear algebra I've actually been doing this since the number 8 days which Peter and a couple of others might remember I'm going to set up a matrix so my columns are going to be reported device strings so each column and a row will be a user so that first row means that particular user showed up with that Google pixel XL2 capitalized as you see on merchant A they did not show up with an iPhone X on A on merchant B they showed up with a Google pixel 2 the word XL missing on merchant C they never showed up and also on merchant B they did not show up with an iPhone 10 and actually the second guy didn't show up on merchant B he did show up on C so those are all reports and that last row there's something a little funny there is this person showed up with a pixel XL2 on A and an iPhone on A so this matrix is highly incomplete most users don't use most merchants so it's a matrix completion problem and we have to basically sort of infer what the NaNs are in the matrix the other observation we can make is that this matrix is low rank there are a lot of string each device generates a lot of different device strings but the rank of this matrix should essentially be the number of devices that are actually out there in the world so essentially if we go back to this matrix the Google pixel shows up as a 1 in 3 different columns in this oversimplified matrix so 1 0 1 1 0 is what generates the pixel 0 1 0 0 1 is what generates the iPhone and these are the things that should essentially be repeated so the rank of this matrix this matrix is essentially two different linearly independent vectors repeated a bunch of times so the rank of this matrix actually should be the number of devices not the number of device strings which is much larger so we have a matrix completion problem and it's very low rank if this seems familiar those of you who work in recommendation engines may have seen it because that's how a lot of recommendation engines work so can we use that to deal with our data problems basically take techniques from recommendations and solve this and the answer is yes we can so we know our matrix should look like this low rank approximation so we have our user vector tensor products are device vector and that essentially represents all the people who are using the same device and then there's an error term which will come from basically those weird fraud cases or a person who uses two phones and those should be uncommon so as I said this is kind of like a topic model the tensor products are our topics and the error term which in our case are attacks are directly analogous to you know if we think about topics let's say a person's major topic is violent action movies but also for some reason he likes the movie Notting Hill just one weird one off case that doesn't say he's in general interest in romantic comedies he's generally violent action movie guy but also he just likes that one movie for some idiosyncratic reason and then the user is a document and an attacker is a document that fits into multiple topics so if we so remember this matrix is highly very much non-square there's a lot more users than there are devices so what we can do is if we want to get a nice symmetric matrix we can just compute m adjoint m and if we look if we do some simple arithmetic on the ijth component here we discover this is the number of users who were seen having device i and also device j simultaneously and as per our model before this is so this is not strictly true this is stylistically true there's a bit more arithmetic that I'm suppressing because I don't want to make this take too long but essentially this matrix can be thought of as a bunch of topics which have size n n is roughly the number of people with a given device so there's a lot of people out there with a pixel there's a lot of people with the iphone and then there's error terms which are on the order of size one which will basically be how many people use a pixel and also a strange little Nokia thing how many people use an iphone 10 and also a highway 2600 the number of people who simultaneously use an iphone will be a much smaller than the number of people who strictly use one so what this means is that and by the way in this what I've done here is I've taken all the NaNs and I filled them in with zeros so what this means is that I should have a set of symmetric matrices and that dj tensor product dj that has eigenvector dj by definition and it's of size n I also have a bunch of weird one-off terms that I'm not interested in which are basically going to have a one or a two there so I can take this matrix and I can threshold it I can basically find some cutoff between the typical number of people who have a given device and anything below that I'm just going to set to zero so if I do that then what I can do is I can just find the eigenvectors of m adjoint m and after the thresholding these right singular these vectors are going to be the right singular vectors of m and therefore essentially it's going to be a bunch of ones or actually it's not going to be ones but there'll be non-zero entries in certain places these the non-zero entries mean that this device corresponds to these things and everything else is unrelated so essentially each dj is going to be zero, non-zero, non-zero a bunch of zeros, some more non-zeros and all the non-zeros are a device that's the same and then when we put this into production for any user we will map their device string to whichever vector has this as non-zero and then if we see another device string for that user that is not part of this non-zero in this vector that is potential fraud and worth flagging so how do we know this works so first thing we have some string patterns we should always start with the simple stuff and we did so for instance we know that pluses are url encodings of spaces so if we do that replacement and then try to check our data we know that Google plus pixel should be the same as Google pixel and this matrix technique when we don't do that as pre-processing first will reproduce the results it also tells us things like HMD global is Nokia which you can't really get from any string Howe is Honor and a few things of that nature which once I Google them it turns out there are just different words for the same manufacturer and then also as you would expect users with multiple devices are actually pretty rare which was kind of an implicit assumption that this model reproduces so this is a reasonable security feature it's hard for scammers to exploit because the only places they could exploit are devices that are rare and you have to find a user with a rare device in order to exploit them which dramatically reduces the pool of users you might want to attack and also keep in mind a lot of our attacks are some guy on whatsapp who took our logo and made it his whatsapp profile picture and he's like hey sir I'm from simple support in order to avoid adversely affecting your credit I'm going to send you an OTP send it back to me and then the user doesn't notice here's your OTP do not share it with anyone they ignore that second bit and then they're like hey who made this purchase so it's very rare so there are now the pool of users who could be attacked this way has dramatically reduced and this guy on whatsapp suddenly has to know you have a rare phone to even get away with this so ultimately this is a reasonable way to do it another use case here is the following you could imagine doing some string matching maybe some dictionaries to match up these devices what if the data were hashed to preserve customer privacy in that case we will have no way at all to do a string matching because we only have a hash of a we have a hash of a google plus pixel we also have a hash of a google space pixel we couldn't do string matching on the hashes anymore but a technique like this will still work these are sort of examples where data is messed up because of our pipeline in principle if we devoted some enough engineering efforts to fixing them then it could be resolved but now let's move on to places where your data is messed up and there's nothing even the best engineer in the world the best team could do anything about it so this is a fairly pervasive problem um I send an email today I'm interested in its conversion rate the number of people who open the email click a link in the email and then ideally later on make a purchase on some website use a coupon code something of that nature so I sent this email out just now I checked my stats and in the past 12 seconds nobody read it should I assume that my subject line sucks it would be a little unreasonable because how many of you just checked your email in the past 12 seconds probably not a lot on the other hand you know 30, 40 days later if nobody read it pretty much nobody ever will this is a delayed reaction problem if I measure too early I get a biased measurement of whether an event will occur and if I measure late I'm usually a day late in the dollar short sometimes you have to take action more recently so this is really pervasive like a simple we issue credit the data we're interested in is 30 days delinquent usually that's like a great metric that more or less tracks our business and it takes us 30 days to discover that so in May I ran an A-B test gathered some data tried to learn something and on July 10th is when I actually had the 30 day delinquency data to measure it um similarly when you're trading stocks you buy it you have to wait to see if it went up or down and you might have to make decisions in the meantime so here's a really concrete example you want to measure a conversion rate and I go back to conversion rates because the actual math problem the actual predictive problem at the end of the day is really simple the hard part is dealing with the dirty data the model at the end of the day is easy so let's look at the basics without delay if you have a conversion rate gamma the probability of gamma giving a conversion by a Bayes rule is you take your prior and you multiply it by gamma if a conversion did not occur instead of multiplying your prior by gamma you multiply it by one minus gamma and so if you had n visits and k conversions you could do recursion and you would basically take whatever your prior is a flat function one on the interval zero to one and you multiply it by gamma to the k one minus gamma to the n minus k and if you do this what you're getting is an estimate of your conversion rate in this case it's about 0.15% with a margin of error that comes out very clearly from the posterior distribution so this works great if there's no delays but again if you do this 20 seconds after sending an email I'm telling you much so what we want to do is we now have to add time into our modeling process and model how many people would have checked their email would have read this email if they ever are going to so our data set now instead of just being a count of count of email sent to count of conversions is a set of times the time between the first email that was sent the second email that was sent and the present and it's also the time between when you actually opened or clicked or whatever a conversion event is so this is sort of what I was describing before is that if you wait a very long time you have great information if you wait a short time you have that information and what do we do in between is the question so the solution to this problem is something called the survival model and there was a talk about this earlier today I hope at least some of you attended I'm not going to go into how to find this but the output of the survival model which describes assuming this event will be observed what's the probability it's observed before a certain time t I think in that talk they actually called it s of t and it was the probability of observing it after but this would just be one minus that so survival curve would look like this after I think according to this 8 there's like a 95% chance it's going to be observed after a time 2 days it'll be about .4 so if we go through the same process we did before doing Bayesian stats we discovered that the probability of a click before time t given a conversion rate is instead of just being the true conversion rate it's the true conversion rate times the probability of the click being observed before t and same thing with no click you just have to multiply the conversion rate by that survival curve if you do the same recursion you get a slightly more complicated function that is still not that hard to implement in Python and so if you use that function you can actually get unbiased albeit more uncertain estimate of the conversion rate so in this example here the green line so this simulated data I had true conversion rate of 25% the green curve which is fairly narrow is if I don't correct for delay bias and I just use the simple Bayesian formula to make my conversion rate and that's way off if I use the survival curve to correct for the bias what I can do here is I can get a correct estimate albeit with more uncertainty and as we wait longer of course the estimates converge to each other so this is the same simulation run but with more time between let's say send the email and measure the conversion rate so the key point here is that in the data we have first reflection of what the final email will be there's two important factors how good the email is and how much time has passed and if we ignore the way our data is generated which involves time we're not going to have the most accurate estimates of it but by incorporating time we can actually correct things so I'm going to close this talk with kind of an open problem that I'm constantly thinking about and never have a good solution for every solution I have involves spending a lot of money so you want to build a selection algorithm and the training data comes from the people or ads or whatever you've selected so you want to estimate the conversion rate of an advertisement the data you have about click-through rates comes from the ads you display it does not come from the ads you do not display when we issue loans we get performance data we lend money to we get no performance data from the people we reject same thing happens in education if you reject a student you don't know how you would have performed so this is sort of a really simple model we have some regressor and we have performance so you can think of the x-axis as being like a credit score and you have the y-axis being fraction of people who repay now we've trained this model which is the black line, that diagonal and it seems pretty accurate so we only want to approve the good people which are the red folks up here so we restrict those since we are interested in loan performance we will only issue loans to the people in the red and we get high performance and this goes great then we retrain our model and we get a new curve and suddenly this says that higher credit score is worse that's not a thing I want to put into production but this is a thing that will happen if you're not careful we've actually just by chopping off our data we've turned a positive correlation into a negative one and every so often on twitter in the news you discover some study coming out that discovers something weird and counterintuitive did you know that height is irrelevant to the performance of basketball players it turns out in the American NBA people who are 6 feet tall, people who are 7 feet tall equally good no correlation now you'll notice that there's basically no one in the NBA who is below 6 feet but usually when this is being reported no one mentions this is exactly the selection bias what's happened here is that there are really tall people who are maybe not as athletic so they get into the NBA because height helps there's also on the flip side there's really athletic people who are not quite as tall but they jump higher so both of these folks get into the NBA and the correlation reverses itself this was a recent one that went out about a year ago everybody loved it it said that GRE is useless in predicting the success of people who get into grad school and then a whole bunch of editorials were written suggesting let's stop using the GRE for grad school it doesn't work now it turns out that basically no one with a really bad GRE actually gets in and therefore they're excluded from the data when they are the generators of the correlation so this thing happens almost everywhere it was first known in the econometrics Heckman received a Nobel Prize for this, George Bourgeois was one of the first people to actually do anything with this and there's something called the Heckman correction which lets you mitigate this issue for linear regression which is what econometrics almost exclusively does fantastic methodology here's an open problem I would love it if anyone has ideas on how do you do this in gradient boosting machines that's what I use so I would be most interested there but even how would you use this in a neural network what Heckman did is almost exclusively based on linear algebra and normal distributions it can't directly translate but how can we solve this problem the way I solve this problem is when I issue loans I issue loans to a bunch of people I know are not going to pay just so that I get their training data this as you might expect costs my company a bunch of money we have to do it because we have no better way forward but if we could solve this problem it would be valuable to everyone in the world you have some ideas on this come find me at the simple booth I would love to make progress on this anyway that's where I'll leave it I'll just say that all of our data sucks and we just need to learn to deal with it okay thank you Chris and Peter can we have Peter also back now we're gonna have a joint Q&A with them if those who have questions we have a few mics yeah people who have questions for Chris I think they were both lovely talks on at least so we have one question there yeah so my question is to Peter so I have immense respect for Anaconda thank you for building that it has helped a lot of people and reduced the friction for a lot of people to get into this field so one of my questions is what events happened through your journey while you were getting python for people to use your product so what all events happened that accelerated your growth so did something specific happen which you know distributed your platform to a lot of more people maybe you gave a talk somewhere where a lot of people started talking about your platform that is the first question next is at the end of your presentation you talked about serverless deployments so let's say if I have a TensorFlow Lite model which I want to deploy on IoT devices where does Anaconda come into this whole picture so wow that's really loud okay the first I think the there's been a pretty steady organic growth for the usage of the Anaconda open source stuff and really there hasn't been any one thing that dramatically accelerated its uptake I think if one looks at the data there's always been a seasonality in synchronicity with the school years and that always said to us that data science has been brought in universities and especially when we look across India, China in particular almost more so than in the west there's that seasonality shows up so there hasn't been any single one thing but flipping that around if there is this latent sort of exponential growth of adoption latent in the space how do you not mess it up and that can also be part of the hidden thing that you don't see but there have been many things we've had to do to keep ourselves from breaking ourselves into jail and I'll give you two very interesting examples one of them is when Wes McKinney Justice Condo was really starting across the threshold for adoption by the core sort of PI data site by developers and it's not one homogeneous mass they all have different viewpoints Wes is the author of Pandas and he's a friend of mine but he wrote a blog post about how we really need to there's a fedora moment for conda as a community tool and you need to create essentially we need to somehow create a thing independent of anaconda that we can sort of rally around just like fedora is essentially red hat stuff without the red hat corporation control and it was an interesting blog post and I could see his points there but the community also recognized how much we were doing to support the community so just as a FYI our CDN and our servers we push out almost a petabyte of traffic every month that we pay for for everyone to use and that's not counting like university mirrors in china and other places and so when Wes advocated for that we worked with him and we also worked with other people who had an interest in buildings in the infrastructure that now eventually has become conda forge so we've added special things for conda forge into anaconda.org and that helped us we embraced the community and served their needs even though it was creating something that could potentially be a fork of what we do and embracing that we created a better healthier relationship and build more trust if we had said no no we got to shut this down and did all this kind of like closing things down that would have been even worse right so it was one thing and the second thing that really was important thing for us was when the they're now more and more commercial platforms that are trying to do the same thing that we do with our commercial platform and they use anaconda as an open source tool inside it and a few executives that we had hired in were very much of the strong opinion that we needed to basically put strong legal language in place to shut them down completely because we were sort of beating us them in sales and then the CEO myself we came out on the other side that we said no we don't want to do that we're going to basically out innovate our competition as opposed to try to dig a moat and if we had dug a moat and we put that stuff in there there's a chance we could have turned the community against us right so in both cases it was a community embracing innovation embracing adoption as opposed to trying to fight it and do it on its own terms oh yeah so the second question really quick we're not doing anything in particular support serverless just yet but it's something very very near and dear to our hearts the single greatest value of anaconda the anaconda pool chain to serverless is that most serverless platforms if you look at fargate for instance on AWS if you look at their what do they call it firecracker VMs they only allow a small amount of storage to go along with each of these VMs because it takes so long for them to move these things around and spin them up so the problem is just to get the basic data science toolset you need that can almost take the entire discota so what we're hoping to do and we haven't done anything yet but what what I'm thinking about architecturally is improving these things so that we can have an extremely tight extremely crunched down set of binaries that are variants of the existing you know content packages that would then facilitate very fast serverless VM spin up that's something that we could do yeah there's a question there so question for Chris within your field credit scoring and related areas or have you had a chance to apply causal analysis or we are still just restricted to looking at correlations or so causal analysis is of course tricky principle every time we do an A B test which is I probably have one or two running at any given time B test is of course a form of causal analysis but otherwise you typically do need to do causal analysis you'll typically need some kind of pseudo random event that can be used as an instrument sometimes these come along and when they do I try to use them but I haven't gone too deep into them because you know when I get them I use them but it's hard to really do that and plus it's actually pretty easy in most cases where I work to just run an A B test and get a very clean very clean experiment where you don't have to do any causal analysis yeah so I have quick two questions so one thing is like with python 2 in 2020 going out of life or cycle so how do you handle this thing like as a good news bad news because there will be a lot of throw away code and maybe the future will be simpler and the other thing is like so many libraries like tensorflow maybe 2.0 maybe older version 1.x is there so for different OSS like windows or linux or mac how do you try to maintain the availability of versioning because if I pull a friend's code and it is not working on my machine generally the environment.yml it is typically a bit frustrating the second part of that question about how do we maintain the old builds is that the question okay so sorry I should first of all thank you guys I use anaconda for three of my personal projects and I found it to be very interesting and so I initially started with pep and then I found anaconda and then it was like really awesome so thank you so much for that first thing first so my first question was like with python 2 going away in 2020 I got the first part but the second one the tensorflow then second question is like so many different libraries are there like maybe you know if someone builds a bird library or some some new things or some old thing going away but just I want to you know pull a code and see it but maybe the initial testing was done on Mac I want to do it on Windows or Linux so maybe that specific version is not available or that library itself is not available so what is the company doing for that or how do we handle those things whatever so two questions yes two questions the first one python 2 unfortunately one of the odd things that happens is that people think that anaconda is responsible for everything in anaconda so when pep breaks in anaconda it's our fault when conda breaks somebody's virtual M it's our fault when someone in pep installs tensorflow and Google has violated the many Linux standard and then they try to install something from anaconda then it's our fault and then of course we also get the benefit people think we invented pandas people think we invented jupiter so there's sort of like a lot of confusion out there but this python 2 thing is definitely one of those things where a lot of people look to us because we are kind of a well-known vendor in this space and I hate to tell you we're not doing anything about it right there's a couple of formally we're not doing anything about it I think I saw active state announced that they would be have like formal you could buy python 2 support from them and I don't know how they're going to staff this or how they're going to make a profit on this but they're they're planning they're planning to maintain python 2 packages for people and that's job is going to get harder and harder as time goes on but they're claiming they're going to do it what I do want to do as a personal personally as a member of the python community I've been pushing for creating new structures that can more sustainably fund paid for development that would allow companies to buy in and amortize the cost of maintaining these old things so it's kind of what active state is proposing to do but in a way that's more of an open source community corporation kind of thing but anaconda as a company we're not doing anything about it we will continue to build python 2 things like we will build the packages as they're released from upstream but we ourselves are not really selling python 2 support or anything like that if it now it could also be that come next year the whole world's on fire and we need to do something about it and then we'll may step in and do something the second thing about tensorflow or something like that where packages available for mac but not on linux windows in the like condo forge world if it's a one of the packages and recipes from condo forge then you know you can get on there and see if you can get other people interested in adapting the recipes to build on other platforms and then for our customers who buy our enterprise platform they can buy sort of support hours from us or small services contracts where we will do package customization and build for them that's absolutely something we do on top of our platform and in fact we'll be rolling out sort of enterprise repository features where particularly difficult packages are a little bit weird maybe not everyone wants if customers are paying us for them we will build those and put them in a special customer only enterprise repository that they have access to so those are the kinds of things that we do for that problem question there yeah so I have two questions for Peter so the big data industry it has seen technologies like spark and Hadoop being used for a very long time right now and with new like better implementations like Dask and all coming out so in the market what have you seen and how much adaptability have developers actually using Dask and coming out of spark yeah so that was one question that I had the first question that I had and the second question is basically how is Anaconda tying into the Docker and Kubernetes orchestration platform in the enterprise I think the enterprise version of Anaconda you're bringing it so if you could give us an insight of how it is being used in those platforms yeah so regarding Dask so I actually I know some of the creators of spark and we met them back in 2013 timeframe as spark was first getting released, Databricks was being created as a company and it hasn't been around that long right it's actually Databricks kind of came in right on top of the big data wave as Hadoop was kind of waning and it was it's initial selling point this is it's just a lot nicer than MapReduce right like writing a bunch of Java MapReduce or only being able to use Hive was really tough compared to all the things you could do at spark especially with Databricks notebook and so what the narrative I see and this is not universal but what I've seen is that the creators of spark really had a vision you know when they were at Berkeley at the amp lab right they had a vision to create an entire distributed computational system that was very fluid I could do anything you wanted to do that would tie graph analytics machine learning data science visualization data data engineering all these into one coherent platform and of course it all ran the JVM and it was all Scala and it really didn't talk to our Python right and that's really challenging for them because as they became successful the niche they sort of evolved into was really around data engineering and it's really nice data exploration tool on top of spark top of Hadoop and yarn environment it's a very nice you know their pipeline tool in the commercial offering is very nice but it always had this like boiling the ocean reinventing everything problem and it treated Python and R as second-class citizens right rather than trying to play better with pandas they created their own data frames right so that so spark created its own sort of friction with the broader ecosystem adoption and so what I saw and I can't speak universally what I saw is a lot of like expert users of spark they were using PySpark because any time they ran into a problem that was bigger than pandas and bigger than RAM they said you know the multiprocessing was too hard it was just googling Python big data right and you would get PySpark and they would just have to use PySpark once they had a different option once we had DASC out there a lot of those early adopters started using DASC because it was so much nicer and it was so easy to scale up and scale down that scaling down thing is really important because if you've tried to use spark on a single machine it's kind of a clunky thing right DASC no problem single machine multiple machines big machine small machine GPU no GPU it all just kind of works but it is younger than spark right it's not as battle tested in some in some senses as the spark data pipeline stuff but it's definitely getting a lot of mind-share attraction among the core like sci-pi data folks the second thing regarding containers and Kubernetes our enterprise platform is built on the stuff and so we basically sit on OpenShift right now our current enterprise platform we have our own Kubernetes that we bring our next generation one sits on OpenShift it installs the cloud installs on prem we use Docker in a lot of places inside it but the whole point is to hide that complexity from data scientists so we see data scientists spending a lot of time trying to like play and learn all this technology and it's really hard Kubernetes is not really a ready to ready to made solution for end users it's a tool for infrastructure people to go and build cluster computing environments right so that's what enterprise platform does is try to kind of paper on top of all that or paper over all that stuff I'll just add one other comment sorry DASC is in addition to what Peter said also in my experience significantly more performance in Spark and the part of the reason for this is that to go from Spark to PySpark you have a Python process running you have a Java process running everything needs to be serialized written to Python, deserialized do something, reserialized sent back to Spark and deserialized and that deserialization and serialization costs actually usually a lot more than what you actually want to do so I've seen like at simple we've switched to DASC from Spark and we're as soon as we can care at the last that legacy code Spark is gone never to return and we've seen significant performance improvements even though we're doing a lot more work just because that serialization overhead is actually pretty high compared to most of what we do and also in terms of forward-looking thing, thank you very much Chris I have a high five right here get the DASC love moving forward all of our technology investments at Anaconda and all the things we push in the ecosystem are around DASC, right? so there's a lot of effort on DASC and Aero there's a lot of collaboration around DASC and Pandas DASC built into Scikit-learn and JobLib as default parallelization inside there all these things are starting to converge together and so you know as we work with industry partners to work with people like DARPA to push on next-gen graph computing guess what all that sits on DASC NVIDIA their GPU data analysis stuff all sits on DASC so it's really it's a thing for sure okay one last question from my question was for Chris so yeah so basically when you're designing a system with payments and you're not not asking for an ODP and directly proceed with the payment what happen if there is any network phishing and some for some of the users they define or they get the details from the network so are we is it a secure transactions or is there any security concerns there and there was one more question regarding the fraud detection model which you are building so there would be some cases where there would be false positives where the users who are actually not making any false transactions but our system detects that this is something which is which is false transaction so in those cases how do we handle it and how much durable is a model to detect any false detect any any phishing which is happening there okay so didn't fully understand the first question but I'll answer the second one since that is much more straightforward so every system has a false positive rate and a true positive rate as well as a false negative rate so it's not like we block your account permanently and you can't transact ever if we detect anything shady it's much more you get a transaction failure you'll get a quick call hey sir we noticed some activity that seemed unusual is that you and if you confirm then the person who calls you up will be like okay go ahead just retry the transaction it should be fine and that handles both things if you think about it this is how credit cards work as well like whenever I use my American credit card when I'm in India like once a month it'll get someone like I'll try to take out money or something and it'll be like your credit card is flagged for fraud I call them up it's unflagged I get my money it's kind of annoying but it's it's not that frequent and I know why they're doing it I'm glad they are because I don't want it stolen and all the money taken out the key thing to do is to balance the rate of false positives which are bad customer interactions against the risk to a customer which is also that someone made transactions on your account someone stole your identity and now you have to handle it so basically what we have to do is just carefully track the false positives carefully track the false negatives and balance these against each other and that's how we decide to set our thresholds so essentially what you need to do the key part of this is you want to have a conversion rate how many annoyed customers do to a transaction failure in a fraud call is equal to how many customers having fraudulent transactions on their account and getting pissed and having to call up essentially once you choose that conversion rate everything else falls out of the arithmetic so you repeat the question you made about I think we can hold on to that we are running into the break a huge round of applause for Peter and Chris I think there were a lot of technical questions but the two things that I will take away is ability and humility from that one talk and obviously final stage of acceptance final stage of grief is acceptance so I think for me those are the two takeaways but thank you so much and we will be back at 1645 for the last two talks for the day please come back thank you talks now for the day we kick started with a few algorithm talks traveling salesman problem, debt collection survival models we had a few sessions on data pipelines which were largely mutable so how do you collect the data how do you manage the system that collects the data how do you get feature engineering and then in the last two sessions we talked about how do you manage the software around it which Anaconda does and how do you manage bad data right so the last two sessions that we have now are going to talk about at least in my head immutable pipelines how do you think about data that changes as your company evolves and how do you manage that and how do you manage it on real time basis so those are the last two talks to kick start we have Agam from Zapper who is going to talk about contracts schema evolutions in data pipelines over to you Agam hello hi everyone I hope you had a good long day with a lot of takeaways so what I am going to talk about is how do you build complex data pipelines which are evolutionary in nature with the help of contracts that is too many words what I want to break it down into is what is evolutionary in nature evolutionary in nature what I mean here is engineering design principle of evolvability evolvability states that it should be easy to make changes to a system when building data pipelines data is always in a state of flux it is always changing the systems are getting redesigned the contracts are being redesigned but still how can you build today that will last tomorrow and your system naturally evolves so the whole problem of over time it gets harder and harder to make changes to already complicated system is what you want to address also I will talk about the challenges that we faced while building these pipelines at zapper from scratch so about me I am Agam Jain I work at zapper media labs as a tech architect and I work closely with the data engineering teams I work there and drive initiatives around how we can improve the quality of data at zapper and in my previous roles in the same organization I worked on the core infrastructure pieces on which these data pipelines are now built now the layer of the talk is that I will talk about what zapper does and how the problem statement that we are trying to solve at zapper is just to build a data processing pipeline and once these pipelines are built what are the immediate challenges that you will face in the organization if you follow these practices similarly I will also talk about what a contract is and how it can address many of the many of the problems that you would face at the later stage around schema evolution similarly and I will talk about the benefits that you will get if you follow this in the organization so about zapper and the media technology media consumption analytics domain as a broadcaster who is generating content every day does not really know who is this real audience who is the audience that is getting exposed to his content because all of the mediums where he is working are all offline and zapper comes in here and solves his problem by implementing audio content recognition through at scale and as a result we become one of the world's media consumption repositories give you an idea about the scale at which you operate we roughly receive 4 billion plus requests every day and process and generate 5 perabyte plus data every day now these numbers are very big and from an engineering perspective you have to really take care of handling this data at scale as well as ensuring the reliability of the data that is getting generated from your systems so now I have talked about given you input about what is the scale at which we operate similarly my next question is how do you build such a system what does it take to build such a system initial ideas and we were building these systems few years back was why can't we just go with the microservices bandwagon why don't we just build very lightweight a lot of microservices who talk to each other and process this data but there is a problem with microservices what happens there is you have to tightly couple your services and as well as there is a inversion of responsibility sometimes when things are not in order give you a concrete example think about if you are building a back end for something like a system like adhaar right and you have a user who wants to raise a request to change his address let's say it also mandates that you also need to raise a request which also logs a request for address verification also address verification if you have a microservices architecture here raise a request to change your address now the responsibility lies with the request that receive this address change request to also raise a request to the other service which has to ensure both transact all the things that they can log together and done as a transaction so doing this is very very hard in microservices so from our perspective we saw this problem and we know that microservices are good to have but they should be coupled with what we really wanted what we wanted was that we wanted to process data at volume we wanted to trade off latency for higher throughputs as well as what really we wanted was an event processing pipeline something like a system like a message carrying was like Kafka now we raise our hands how many of you use Kafka in production a lot of people here so of course Kafka why does everyone choose Kafka Kafka scales that is one of the most important characteristics that you want from your system as your user base increases or the type of workloads you are trying to run increase the system also has to catch up with that so Kafka helps you with that similarly the message semantics that Kafka gives you also were very related to what we wanted we wanted to never be in a position where you would lose data Kafka semantics of at least one guarantee is what we were looking for similarly there was a emerging pattern you can see here in the diagram that there are producers to the cluster and there are consumers from the cluster and if you chain them producers and consumers together basically build a pipeline of event processing which you can build complex data pipelines over so a repeated pattern of these producers and consumers is what we also wanted why because it really decouples relations between one producer and the other consumer they can both operate independently now the type of use cases that we've been using example around Kafka are around real time processing for example we have a requirement where you want to deliver events you want to deliver events in a very short amount of time or you want to enrich your message that is flowing in your system in a very short time you can use Kafka for that you can tie it up with services like Samza or even spark streaming to enrich your messages and throw them in the system in low latencies or you can micro batch your workloads and micro batch your workloads and take a set of messages together you process them at once so you can use something like spark streaming with Kafka to do that similarly you have ability to since you are using a message buffer here your data set is infinite what your events that you are receiving you are going to receive it even tomorrow so there is no idea of completeness of this set here but what you would want to do is take this data and back it up for long term storage so you can use something like Apache Plume or a Kafka connect to dump these events for long term storage if you are building more complex systems you can join it with batch processing systems and start another batch processing pipeline over it by using tools like Apache spark or MapReduce I have talked about although I have given you enough input of how you can build simple data pipelines at your organization so what I want to talk about now is what are the challenges you will face if you have these in production if you have already built them and are now maintaining them so before we directly jump into the challenges I want to talk about what you should do first before building any data system so you really need to ask some fundamental questions around what is the meaning of when you say data is complete and very importantly when you are working with buffers and endless streams of input you have to define these terminologies similarly if let's say data is generated from these pipelines and you have a data leak that is running over it or we have another whole app system that run over it you have to define that is your data consistent with other data sources if your data is reprocessed or for some reason has to be regenerated is it happening within the bounded SLAs and lastly if there are component failures in your data pipeline how does it impact the final data ideally it should not that is what I want to talk about with respect to pipelines what your concern should be that you should ensure that each component of your pipeline is stable and if it is not stable you still have to ensure the reliability of the data you generate from it so think of an example where you have a phone who has a messaging app in it and you want to deliver a message to your friend it is entirely possible but it is not there so you are working with a system where you have unstable system similarly you can still make your event being delivered over the network when it comes back live again this is what reliability means over unstable systems so the first set of challenge that we face by building these data pipelines was handling downtime of services services are other systems that are managed by others and not by the data pipeline and they go down they will go down for a short amount of time and for that time you still don't want that inversion of responsibility to happen in your system what you should do here is take a definite answer that this is not your responsibility of ensuring a different service being going down because it breaks the rule of inversion of responsibility so what we did when we saw these type of problems was that you would fail fast you would fail fast and send the message to the service that went down and when the service comes back live it will re-read those messages and put them back into your system and hence your data reliability is ensured even if components are unstable similarly handling slow consumers was the second problem now imagine a system where you have Kafka running in production where you have a pair of consumers and producers production rate can at times overshoot your consumption rate what that leads to is a deficit between production and consumption so let's put some numbers in context let's say producer in your pipeline is emitting messages at 100 per second and consumer can only read at 60 so you have this deficit which is always increasing per second which has to be caught up and you don't want to be in a situation where the deficit is so big that it falls out of the buffer window so solving problems around slow consumers is really important and has to be solved at each component of the data pipeline now we solve this problem by ensuring that we always measure we always measure the delay between the producer and the consumer and if it overshoots a certain limit we have to be informed about it because the problem that it leads to is that it leads to silent data losses which you will never keep track of similarly it's really hard to keep track and leads to cascading failures of systems because think of a case where when you started of building these systems the producers and consumers were always in sync and but over time over many releases over increase of the user base it can lead to a situation where there is a deficit started building that is when you have to step in and scale up your systems when the consumers become available so the way we went about solving was measuring as well as scaling up consumers as and when the system was lagging behind it now data loss is by design not handled when you work with message buffers so even in the last case you had a situation where you would end up doing data losses but there were ways to solve it but sometimes there are no ways to solve it sometimes think of a bad release that goes by where you start randomly dropping events or you start or these are bad release which starts emitting corrupt messages what do you do in that case that also is a data loss so and generally this was a milestone for us as an organization to reach a state where we didn't have any data loss so the way we went about solving this problem was that we would checkpoint at every hop as to what are the messages we are receiving in the system and also we would input generate a unique id to every message when it comes into our system and on a daily basis reconcile as to are we losing any events anywhere or if there was a need to replay some set of messages you can pick them up and replay and put them back into the system so you can still ensure the reliability of the data in terms of when your systems are having a lot of problems in them so till now I've talked about all the cases where I've been talking mostly in terms of reliability and stability of components the next point that I want to talk about is more on the lines of more on the long term perspective now it is about having a global view of the schema now as the organization grows from 5 engineers 15 engineers what will happen and you already have these data pipelines running in production what will happen is when new people will start contributing to our systems but the problem that they come up with is that they would not have any idea about what really is the message or the scheme of the message that is going in the system it really becomes a problem think of a case to really have scale from 5 to 15 engineers where everyone is trying to contribute one of the cases that we saw was just by free chance like it is entirely possible that somebody changes the semantic meaning of a field for example somebody can might as well easily change time stamp if you follow time stamps and strings somebody can just change it to epoch the systems that are going to be using it are just going to blow up so this is not what you want so having a global view of the schema is really a requirement I mean your organization scales so the way we went about solving this problem was that you would establish contracts contracts is not between the system and the users the contract really is between the developers as to what and how do you model your system and what are the fields and how do you introduce them into your system think about contracts as model classes in a good design code base where responsibility drives interaction between model classes and you can build complex hazard relationships around models to model your pipelines as well take for example a sample message I created for a data pipeline message you have a user who you have a data pipeline message who has a user, who has a device who device has a device class user also has a location and you can hereby maintain generate the hierarchy of messages that there are in your pipeline and this really clearly models your systems now I've given you an idea about why these contracts are important but what is the format of choice you would use when you really implement them think about model classes let's take for example the model classes now if you if you like one language for example you love Java in your organization or you love Python in your organization you can use their serializers to send these entities over network and on the other side the problem here is it's a very naive way to do it as well and on the other side the entity that reads this message also has to be Python or Java you and also it does not solve the still solve the problem of evolving the schema if you want to add the field at one point over the network you send it to the next step in your data pipeline it will not be able to understand what you've what you've changed in the system so having a language specification is something that would work but it would just fail very very quickly similarly then there is JSON as a format everyone here pretty much loves JSON we can talk in terms of JSONs but the problem with JSON is that it is verbose it is good part is that it is verbose bad part is that it is verbose and what happens is when you're building systems at this scale it becomes slow to parse as well as it has its own limitations around it cannot handle byte arrays it has constraints around how do you send images over network inside a JSON message of course you can work around by encoding them into different types but that is not naturally supported similarly you have specialized serialization desolation libraries which take care of reading the message, reading the schema and serializing the message that you want to serialize looking at that so Avru is one example Avru reads messages top to down and serialize them into byte arrays and on the other side the schema is again needed to desilize a message the problem that we saw when you were evaluating Avru was when on the other side to desilize a message you really need to have the schema that was used to serialize, be present exactly same schema that was used to present to desilize it that becomes a problem in data pipelines perspective because one way to naive way to do it would be that normally to do that would be that you emit both the message and the schema in the same blob and send it over the network now that sort of adds an overload to every message and becomes clumsy so what you went by finally was using something like Drift Drift is a project donated by Facebook to the Apache foundation which has a lot of other things but it has a very good serialization and desilization library Drift serializes messages basis of tags and these tags store the type of fields that you want to serialize and the payload so it has very straight forward serialization technique and on the other side you can desilize it also gives you have the ability to have these schemas being available globally the advantages of Drift other advanced Drift are really defines strongly what are the type of fields that you can have in your system clearly defines that you can have a byte area you can have an in short or even double or a floating number and it allows you to, since everything here works on tags it has very easy renaming of fields and types so if you want to evolve your system and call these fields something else tomorrow you are still able to do that but still hold in tag the semantic meaning that you started off similarly Drift has bindings for all the major languages that you would be using so the model classes that I mentioned earlier Drift compiler can generate those model classes in any language that you are going to use so it really eases development effort and solves this problem of cross languages in one organization similarly it has a very fast survey for messages that allows you to quickly serialize and desilize the messages for example let's take a structure that you want to serialize which is a person which has a username to it can optionally have a favorite number and has a list of interests now the binary, the way you want to serialize Drift is you decide a protocol and you just write the tag field name the type it is and write the payload of it and just put it back to where and it will be able to desilize these messages and reconstruct the object on the other end of the pipeline this is what we used and I mentioned briefly mentioned Drift in the script language the same diagram that I showed you earlier the entity relationship diagram that I showed you earlier can be modeled as a global written text for your organization and you can check it into Bitbuckit or your GitHub repositories for everyone to have a common view of it now I've talked about contracts I've talked about why you need these contracts but the next set of problem is contracts will always change these contracts are bound to evolve that is what we initially planned with so think about a scenario in your data pipeline where somebody wants to add a new feature to the system he updates his feature he updates the contract and he has to introduce a new field in your contracts let's say he introduces a contract version 1.5 all your pipeline till now is running on 1.4 and he starts emitting a new message which is in 1.5 the consumer of this is still running on 1.4 what you want to ensure here is that the message that the consumer which is trying to read this is still compatible and it can read what the information it wanted so what you have to ensure here is backward compatibility you have to ensure that any change you did to the system is backward compatible another example is let's say the consumer gets updated first than the producer who has to ensure that the contract which is emitting on let's say is 1.3 and the consumer is running on 1.4 consumer is still able to understand what the message is being sent so you don't want to break your pipelines at any point in time and that feature is forward compatibility so any change that you want to do to your system you want to do it both in forward and backward compatible manner and this is the concept of full compatibility where you make a change to a system and ensure you've not broken any forward or backward compatibility rules so your systems won't break my question to you is will all the data that your system is generating and let's say it's backed up on a deep storage for yours will that be fully compatible as well probably not problem here is you only ensure compatibility between version X and the version before it was X-1 what you need to ensure is transitively ensure the compatibility between X and X-2 X and X-3 up till the first version you cannot break even one contract version otherwise the data that you're trying to read becomes unparsable to give you an example let's say you have a version X-2 which is running in which you decide that you do not need a timestamp field like I mentioned earlier you do not need the string timestamp field then you introduce a version X-1 where it is deleted so it is a perfectly normal compatible change both forward and compatible if that field is not necessary for you now in X-1 to X you add that field back again to your system with a different type which is let's say a long distance so you basically break you still are compatible with X-1 and X but you've broken the compatibility between X-2 and X similarly these cases can happen so what we did at Zapper was that we somehow tried to capture the full transitive compatibility via Thrift and by putting some guidelines around it so what we did was we've ensured that nobody introduces a new required field or remove a required field nobody adds a new field which is nobody new fields that are only added to a system are all the optionals index tags that are being put once are never used because index tags are the ones that go to the binary as well as you do not really remove any optional fields but you just mark them as deprecated so that you prove yourself against again against using that tag again and changing the type of it so if you do like follow these practices the systems will be fully transitively compatible now the benefits that your organization will see if you follow these practices is that you pretty much have no task of data claiming you will never have to have any errors, value, process data which is 3 years older with your current contracts and similarly it will lead to a save in cost as well as time second thing is that you will start thinking of building these data systems as a pipelines as pipelines I mean you would think of them as a contract and data pipelines will look to you as a set of contract that you need to establish even before you try to build them and that sort of eases a place, puts things in place so that you can do a development faster similarly if you have more than one data pipeline in your organization you can reuse all the structures that you've created from one pipeline from the first pipeline and hence you can tie them together at a later point in time when both are ready so it really eases the management of data as across your organization now lastly it really the most important bit here is it improves your level of productivity it really becomes predictable who owns which part of the pipeline and a lot of people can now contribute to your systems so predictability was one of the most important things that we tried we were able to solve by contracts and evolution of contracts so yeah this is a QR code you can scan, get the references I've used for the deck any questions well a lot of applause for Agam and we also have some time please we have time for a few questions before the next speaker so questions for Agam did you evaluate google protocol buffers or fast buffers along with thrift yeah so when you are evaluating thrift google proto buff is also something which is also tag based it has very similar semantics to way thrift was solely because it was google and we wanted something apache as well as google proto buff is not used by google now so it is something that we didn't use it was when we were evaluating we sort of skipped over more questions yeah so how do you solve the small files issues like when you are storing the data lake even hdfs or s3 anyway it is a small files problem okay so we would use tools like flu where you can read these messages in batch and you can set the size of the role so you can set it either based on time or on based on size and then you can actually create let's say 500 mb files if you want to put them if your block size is 12 mb you can create these 12 mb files then okay yes for data things data requirement I think I mentioned there as well was you would have that in hours so data things will make your data available from few minutes to few hours that is a treat up just so you don't sorry can we have the question on the mic so let him ask have you faced challenges where having a change in one input contract or a contract affects the other contract or do you have a framework to manage that I wouldn't quite get the question can you give an example so suppose there are multiple microservices running and a change in one input contract between probably a and b leads to a change in schema where the interaction between b and c fails so is there a way to track it before I guess like am I not audible or the question is not clear so what I understood is I think what we've tried to do by using contracts is we've established of hazard relationships a user has a device so only those type of relationships can be captured so then what you end up doing is since you've modelled your pipeline like that all you do is add fails or remove fails so it should not affect other places apart from where you wanted to make the intended change the backend system can work the way it wants to work but finally the message that is invited here is just one okay one last question there so like how do you scale how do you scale the Kafka clusters like generally when you add a new instance it doesn't automatically replicate so how do you handle that situation right there are a lot of tools out there I think Airbnb has a Kafka which you can use and then yes agree new brokers can be added and then you have to rebalance your partitions should be straightforward if you automate it okay I think there are a few more questions but we'll have to take it offline or also clarifications that you are asking thank you Agam and okay we're at the last talk for the day for those there are still flash talks going in the other auditorium so if anybody is interested to talk about their own work and get feedback and engage with other people in the community should be open source it can be just five minutes you don't need any slides and you can also give a demo if you need to please make your way to the other audience there's a whole lot today about data and data cleaning data pipelines, data contracts one of the topic that I think we have not talked about or people have asked about is real-time data so what happens when the velocity of this data increases and I think we have a good example here from Hotstar at least I was watching the World Cup on Hotstar so I'm sure there is a lot of throughput there my bit to there video sharing or video streaming so we have here the last talk by Namit from Hotstar analyzing high throughput data in real-time can everybody hear me? thanks, good to go so great talk by Agam though we'll be talking about our parallel idea as well so we're going to talk about how do we process and understand data, hello how do we process and understand data at scale, right? we're also going to discuss certain use cases that Hotstar had and how did we solve it for getting a lesser turnaround time and to react quickly so I'm Namit from Worker I am in the data engineering team of Hotstar I'm highly available at Namit.hotstar.com you can reach out to me if you have any queries regarding that or if you want to ask me any questions anytime sorry so a little about Hotstar, most of you have either used Hotstar or have heard about Hotstar from someone or the other so I'm going to define a term called Hotstar Scale by which I have seen a couple of patterns which I could divide into three main patterns mostly which is ingestion, storage and consumption patterns so the Hotstar ingestion patterns kind of look like we did a million events per second that were ingested for 30 to 45 minutes consecutively and a direct ingestion goes into a mesting queue which is a party Kafka ingestion layer is highly available and durable in nature and we are kind of projecting around 50 billion events per day in the near future similarly for our storage patterns we ingest around 14 TVs of data a day at certificate events and we have a traditional data platform where we store our data in S3 and HDFS are the data lakes we warehouse in HBase as well and we project at around 30 TVs of data per day similar to our ingestion patterns our consumption patterns on the other hand which will be the highlight of the particular presentation is 8.1 TV of data which is directly consumed through the Kafka clusters right be it your stream processing frameworks which we will talk about use cases that is going we will discuss about them in detail in the next slides be it stationary data be it archival jobs be it connectors that is running or be it some pre-processing that we need to do all of them account to around 8.1 TVs of data directly consumed what will we be covering in this presentation is basically what led us to experiment with stream processing at Hotstar why will we push towards this direction for which we will discuss two main use cases the case studies being video delivery metrics we will discuss how there are certain crucial metrics or P1 metrics as we call in that help us alert and improve the performance of the video delivery on the other hand we will talk about social signals which is a very interesting concept that our product managers came up with and which is mainly engaging with the users and kind of it's like targeting and engagement of the users that we do which we will also discuss so being a platform with the scale that I mentioned before there are lots of like millions and billions of clickstream events that are ingested daily how do we track and this is happening to you but how do we know this is happening to multiple users at the scale like the scale we maybe a 5 million people or maybe a 2 million people or maybe there is increase in buffer time like suppose due to a feature release of the video team there is some sort of increase in buffer time how do we figure that out and roll it back in time so that we don't lose our users social signals social signals are basically engagement or indicators that we use over here like you must have heard of our how do we actually figure it out I want to show you a personalized online video viewer on the platform right how do we do this in real time is what the question is is what led us to experimenting over here definitely not on my batch at such a scale where viewers generate billions of clickstream events right and these billions of clickstream events if you go on batch processing data cannot be processed in real time or you cannot get the app metric suppose like there is a issue where the video is failing right you need a metric you need a metric calculation for that particular video or that content ID and what's happening is if you calculate that metric through batch processing you've already lost a couple of million users by the time you realize something is wrong with my platform and my behaviors kind of struggling right similar to user notifications to increase engagement maybe if you're paid customer or not how can I understand so much of data in kind of real time let's move on to a first case study right a person spends on buffering some amount of users drop off suppose for every second a couple of thousands of users drop off and you are not getting alerted on time if you get alerted after maybe half and millions of users are dropped off in the platform they may be already complaining and ranting about on twitter and you can't do anything about it this is to make sure the last mile delivery of the video and the behavior of the app at the client side is made better we can serve you better in a way right now how did we solve playback failure rate playback failure rate is one of the main metrics of video delivery right now right now we in just multiple clickstream events as I mentioned if you're watching a video and your video starts playing out successfully we fire an event called the started video event and when your playback fails we fire an event called the failed video event both of these events are ingested into the data platform and playback failure rate is basically the distinct ratio between the failed video by started video which means the percentage of users that are facing a failure on playback how did we solve this we used a kind of a high sequel integration that comes and I will go into the flow of how did we go on to solve this so giving you a brief example like suppose a person is watching a match and his playback started successfully we fired a started video event stating the multiple message IDs maybe the time it started the content ID the platform and there can be various granularities of details that we can send right and it's pushed into a data ingestion platform and into Kafka similarly someone else started watching the video on another client as hype tables and how can we do that is we use the Kafka kind of the Kafka particular time like if you window the data like how did we solve this we windowed the data into a predefined window of 10 minutes and we query on that particular data set only now that particular data set or that particular table that we are trying to query is a series of change logs that are there and which make the table right now we kind of leverage this we got subsequent latencies we got to know what a playback failure rate in the last 10 minutes have been the 10 minutes can be configurable it's totally up to you you want a permanent rate you want to calculate for a second you want to maybe store a longer time of data depending upon what your use case is now how does this help us