 Hi, I am Puneet and I work as a VP for data engineering at Xeratum. Xeratum is a startup in Pune which is into big data. We are into services as well as into products. So today we will be showcasing one of our products with the use case covering machine learning use case covering digital propensity. Regarding me, I have over all 11 years of experience. I have worked with Oracle, worked with Pobmatic, I have worked with Amdux and Teasis. I have been into big data space for now like 7 to 8 years. So let's go over the agenda. So we will be talking about digital market propensity and where exactly it is applied. Thereafter we will be covering the legacy implementation and the challenges which were present in the original machine learning pipeline. After that we will be talking in general regarding the challenges in building a unified platform. So it is ETL machine learning platform. After that we will be introducing Ake Streams which is the product using which we will be showing this particular use case. We will be talking about the simplified architecture for Ake Streams, features and the benefits. I will be showing you one live demo for the digital market propensity and the top differentiators and the road ahead for Ake Streams as a product. Okay, so most of you would be knowing about digital propensity. So it actually predicts a user's purchase trend. Not only based on his, you can say, activities which he is doing online but also using his demographic information. Demographic information is very much needed to do, you can say, propensity modeling which is more relevant as compared to the ones in which only the user activity which is the clickstream data which is being used. So in this particular digital propensity we rely not only on the customer's activity but also rely on his demographic information. So the demographic information could be his salary, his marital status, the household size which he is having, you can say the language which he speaks and all those information. So in simple terms if you try to understand digital market propensity it is like if you are a customer then how likely are you going to purchase a given product for a given brand or a given price range in that particular month. So in short, this is what digital market propensity is. It can be used, this particular machine learning model can be used and is used in search and browse recommendation cases to enhance the browser experience showing the relevant items to him, not showing the one which is actually out of range and he cannot even buy those or not showing the brands which he is not even interested towards. Then we can even go for discount optimizations where in like based on his purchase propensity towards a price or a brand we can give him specific or we can specifically target the discounts instead of giving a generalized discount which doesn't applies to many of the customers there. It can be even used for ad monetization part for relevant targeting and then the futuristic shopping experience for tagging customers one similar to what we are seeing outside where a mirror is placed where in you go and tag yourself and then find out or it actually lets you know like how do you go about like it is to you can say enhance the in store you can say customers experience there. So once it is able to target that customer like in find out that particular customer and it's his propensity towards the price and brand it can guide that customer for the relevant you can say items which can be shown to him. Okay, so talking about the legacy implementation and challenges so this particular market propensity model actually uses logistic regression where in it was used on 30 million customer base with 200 plus millions activities for I can say on the clickstream data and the in store data. So we use three different data sets here. One was the clickstream data. Second was the demographic information for the customer and third one was the product data. So these three data sets were used for the logistic regression model here. We used five feature components here which were product category. So the models which were generated were at these five features at components so product category was one of them product subcategory was another. So when I talk about category it is like jewelry clothing when I talk about subcategory for clothing it could be like tops or bottom shirts then comes the age factor. So age buckets was there and then the gender and so we created all the possible combinations for these feature sets. Okay, so taking into account these five different combinations it usually comes to around for each given customer there were around 12 to 3600 plus different combinations. Now if this particular model was trained for four months of training data so you can imagine like 3600 rows plus the date part which actually takes it to 300 K plus different columns is the size of the column pivot table which would be created. So given such a huge size of the we can say denom table with a pivot of 300 K plus columns the pivoting was a problem. They were finding it difficult to create pivot on such a huge and large data set. So talking about the process first so it was like the demographic data bucketing was done first. So when I talk about demographic data bucketing it is like for age we divide into different buckets for based on his salary we divided him into different buckets for household size we divided him into different buckets for the languages which he speaks. So all these feature sets were bucketed. Second part as part of the feature for this particular model was the clickstream data was used and we use a shiny aggregation here. So shiny aggregation when we talk about shiny aggregation it is like records so if I do a particular activity in a given month it has to be we have to create the record in reverse chronological order for all the combinations out there. So and then after doing so if I buy a particular product in this particular month for last four months I have to create that particular record and then do the all possible combinations of aggregation for that particular customer. So third part of the feature was generating the product and demographic pivots which was very huge and later on they used to vectorize and then standardize the data set. Standardization was necessary because there are most of the cases where the number of product views is much higher than the number of products which are actually purchased so we have to scale it to the same level. Now the challenges which were there in the legacy implementation was only 5% of the customer data was used because of such huge size of the pivot. We only referenced the top 20 features for doing the feature engineering and ML model training it took 18 hours to train the data. We had varied scores because of the sampling issue because and it is liable to happen. So these were the feature engineering issues which were there. Okay so we have talked about digital market propensity and we have talked about the legacy implementation which was like creation of pivot vectorizing it. Pivot was an issue there. Now let's talk about challenges in building a unified platform in general and thereafter we'll go over the optimizations which would be necessary for which we had done for optimizing this particular pipeline using this particular application. So in unified platform so when I talk about unified platform a platform which actually not only lets you do ETL for batch processing it lets you do stream process it lets you develop and create stream processing pipelines and also lets you create machine learning pipelines as well and when I talk about machine learning pipelines it not only lets you create the models but also allows you to score the models as well. There are products and tools which actually lets you do the training but again they lack features to actually help you do the scoring easily. You have to go for some changes without which it becomes difficult for you to go for the training part. Then there are lots of low level programming still required even though we say that there are interfaces provided by Spark such as Spark SQL but still you have to go and write a lot of low level programming codes there. So third is like there is a lot of time which is wasted in debugging like where exactly what failed in which stage what was the reason for the failure at a given stage so you have to go and look for that particular issue of the failure in the logs. Now then again there are gray areas on the fault tolerance what happens if my running pipeline which is a streaming pipeline fails. Then there are still issues there connecting multiple targets in one single pipeline there and then security is another important feature which needs to be present. Then there is no collaboration so collaboration is only there on the notebooks but when it comes to application we don't feel or see any collaboration which is present like if I am a developer then I can share my code with someone else and then he can reuse from that particular code. Okay so extremes so in short it is enterprise ready self serve unified so batch and streaming platform and also it lets you develop the machine running pipelines as well in the same canvas. So the benefits of using extremes is it lets you do the development very rapidly and the reason is the simplified drag and drop interface similar to I can say one of the top tools in the market it is almost or it is better than those tools which is there right now. Second is the abstraction the complexities are abstracted and you have for all cases where we found that you have to go about writing custom codes in Spark for doing that particular tasks especially related to stateful aggregations or doing machine learning modeling and scoring all those you can say complexities are abstracted here low cost because you get to develop your pipeline in very fast and also it lets you provide other added or additional features which you have to take care when you go to production so it is not only the pipeline which you need to develop but also the features like capturing of metric information like information which is useful for DevOps to know if a job is going to fail or not. Second is like what happens if a wrong record comes in will how can I skip my particular record and do not let my job fail so all and the further tunings so all these things are very much necessary other than the pipeline which you need to develop and then there is no vendor lock-in because it is all open source or the custom code what we have written so these are the benefits of using extremes regarding the features we have 20 plus IO connectors which are tuned which can allow you to pull data or push data to the sources and sync at scale we have 50 plus operators to do your ETL processing activities be it related to you can say extracting the data or you can say joining the particular data sets or doing enrichment of the data so all those operators are present beat union, rollups, cubes, everything we have 45 plus estimators so they are almost all the estimators which are actually provided by Spark which can be easily used as a simple stage configured you just need to configure the hyper parameters there and it is you are ready to use them in your model and all the 15 plus models supported by Spark is also you can say be easily modeled on top of extremes okay so regarding the features of extremes it lets you schedule your workflow it has a common marketplace wherein we have defined pipelines which serve for a particular use case like today we are talking about digital market propensity so that particular pipeline is already ready to use you can just download that particular pipeline and use it for your you can say use cases just by changing the data sets the logic almost remains the same there then comes to it has a unified batch and simplified pipelines you can visually create the ML and ETL pipelines so when I talk about visual creation of ML pipelines it is not only the drag and drop interface but also understanding how exactly the different hyper parameters which are changing in your machine learning model for different estimators or the models what is the impact of change of a given parameter then there are 100 plus ETL and ML components connectors for the big data system very intuitive dashboard for metrics and monitoring which is missing in Spark right now and then the most important part comes to SDK for developers so if you have your own custom code base you can just put it in in the custom plugin stage there in this canvas so the input would be the data frame and the output would be the data frame so any custom jar which you have written it can easily be embedded on the extreme so even when it comes to the migration from your own old legacy systems to extremes it is very easy for you to migrate okay so let me show you some of the features like of extremes here so this is the simplified drag and drop interface where in talking so I am creating a pipeline here which type of pipeline so batch also covers machine learning pipelines you can put in different components now I am using Kafka as one of the connectors here to read the data from Kafka I can do the configuration I can provide the name of the I can say schema out there which is the extractors JSON, XML and all other file format extractors out there where you just go and define the schema and you have your you can say sync ready for reading with that particular model you can even do the joins with the different data sets here specify the type of the join and what type of join it is whether it is broadcast or the other join supported by it finally you write to the target this is how it is easy to construct a pipeline for scheduling you just have to go and provide the name of that particular pipeline specify the time when actually that pipeline needs to be run okay then it comes to that so if you see here it showing green means that that particular pipeline is running if it is failed then it will show in the red this is the monitoring dashboard which shows the advantage of using this dashboard is it shows you the metrics across all different batches which are on which is very useful for you especially for the streaming pipelines where in looking at these metrics you can easily identify whether your job is going to fail or not because if the input has increased and your memory footprint is also increasing or like if it is decreasing but still your memory footprint is increasing you can easily find out whether you can say job is going to fail or not not only it has direct link to the yarn cluster as well where you can go and see the logs for each particular stage or the target you can see you can say the sunburst metric chart which where it shows for each job stage and the tasks the different metrics when it comes to security it has active integration with LDAP and then you can go and create the users choose the users from LDAP and then you can at custom level you can provide for that particular user whether which access he needs to have even if like if you see here okay so when I talk about access it is like even for creating the sources if there are different types of sources you can spill restrict that particular user to have access to that particular source or the target or the data sets of the custom plug-in stage for the IO connectors you have all different IO connectors for Kafka, Kinesis, RateStream for targets you have Cassandra, Hive so all these connectors are readily available for you to use once you define your syncs or sources you just have to drag and drop that particular sync you can specify the schema and then do your ETL and write it to the output it is that easy for you okay now coming to the audit and collaboration part so each and every activity which is being done on top of extremes is captured and it can be shown as audit which user has created what component which user has deleted which particular pipeline everything is audited and shown to the users second coming to the collaboration part so if you have access to a given pipeline you can request for that particular pipeline access for that user once that user approves that particular component access to you be it at source level, pipeline level or the target level any of the components you can go and reuse that in your pipeline so here I am providing access to one of the user who has raised the request once it is provided to him now he can go and use that particular pipeline to him so it is like so consider for the case where in like a lot of developers who needs access to a particular source of the target so there could be a DevOps guy who can allow that particular access to those developers based on his need so this is the audit part where in whatever changes we have done is being shown to that particular user now coming to the error handling part so if you see here you can just go and so when you create a pipeline you can just choose the sync where you want to see all the errors or the error records which do not satisfy your particular pipeline logic to be written to it could be a console or any Kafka target wherever you want to write so once the pipeline runs if there were some few error records then what exactly happens is it writes those the records error records to those things with the details like in which stage that particular record was rejected what was the reason for rejection of that particular record so what exactly happens is like in one shot you get to know what are all the features or you can say ETL mapping logic which is missing in your pipeline which is defined and in one shot you can do all those changes instead of like doing iterative you can say changes for each an error you find out while you run your program okay so coming to the extremes architecture so on the left you see the different data sources to which you connect to on the right are the data sinks on top of that you see the different components related to administration and the pipeline life cycle where in administration we have installations which is very made very easy there it has integration to security LDAP you can submit to Kerberos you can submit job to Kerberos clusters you can import and export the pipelines from marketplace for given you can say use cases you have the audit there to know what exactly any user has done you get to develop your pipelines very easily using the drag and drop interface it has error sinks it has very you can say informative metrics and monitoring page it has versioning support scheduling and the notifications as well like based on the monitoring system if I know that my job is going to fail in a streaming mode that notification can be published there okay so without spending much more of time let's start with the market propensity modeling on top of extremes okay so here I go and define the machine learning pipeline which is the name of which is propensity I go and specify that there are no error sinks as of now it is a batch pipeline you can go and specify the different spark parameters out there so the first data set is the click stream data set which needs to be used for your market propensity model so you drag and drop the click stream you can say source there after once you have once you connect to the source then you go and define the schema so for click stream here we have used two columns which is like customer ID the product ID the and the event what kind of event it was whether it was purchase whether it was view or whether it was add to cart and in which particular month that particular event took place so once you define your click stream schema it extracts only those corresponding records out of it there after you go and join your existing data set you can say click stream data set with the product so you go and define the product data set it could be present on high or it could be a text file a file present on SDFS so you just need to provide the path out there specify the schema of that particular file so either you can add the columns one by one or you can even infer the particular schema of that particular file by just clicking on uploading a sample file for that data set so for so for product we have used product ID and the product category subcategory and the brand these are the three columns which are needed for feature engineering later so there after there is one more data set which I talked about was the demographic data set this is a bucketized data set wherein each record is present at customer level different buckets age buckets income buckets household marital stages so all those features are bucketed there so here I'll just use that particular file okay and then I just upload that particular sample file and it creates the schema out of it so this is how easy it is to even infer the schema of a file which has a lot of columns especially to this particular data set which is bucketized okay so now we have a click stream data we have product data we have demographic data there after we go and do the join of product data with the click stream data once it is joined I can even project the columns so there are a lot of columns which might not be needed so I can project those columns only the ones which are required and leave the rest at that particular stage this is the stage where in which I was talking about the custom jar stage where in if you have any of your code base which is present there you can just specify the name of that particular class and the jar path and that is used here so it just needs the input data frame what was what needs to be passed to that particular jar and you need to specify what is the output which I'm expecting once that particular custom jar is run so in this particular part what exactly I'm doing is I'm using the product and the click stream data set okay and then I'm doing I'm vectorizing that particular the records at vectorizing the records at ATJ ID level that is the customer ID level for each combination of P1 P2 that is like product category subcategory brand the kind of the event event month, age and gender so this is almost like a combination of 300 plus k rows so those many number of columns were present in this particular custom jar there is one more change which was needed since Spark has issues it is more inclined towards creating a dense vector so it was taking a lot of time and this particular case when we did some optimization and then what we did was we specifically asked it to only use sparse vector there using that it not only reduce the size but it was also faster so that particular custom logic was embedded in this particular custom jar and we have a vector which is created at customer level for product and click stream there after we go with standardizing that particular vector which is creating created because for product views there will be lot of counts as compared to the product purchases and add to cart so once that particular vector is standardized then we go and join that product vector with the demographic vector here with the demographic data set which was present we join on top of the customer data set and the resultant is the demographic you can say columns the product vector at customer level so here we again select the columns which are needed for the creation of vectors so when you want to create the vector for the demographic data sets you can even specify which all columns input columns you want to use in there and the resultant is the demographic vector column which is present now the further part is the projecting the columns so the demographic columns out there there is demographic vector there is product vector so what I do is I just project it and I push forward only the customer data the demographic vector and the product vector after that I go and use vector assembler wherein I bring the demographic and the product vector in at one level in one particular column so I use the vector assembler here I specify the input columns as the feature vector and the demographic and the output column as the assembled vector here so this is all the feature engineering tasks which we are doing and you can see how easy it is to do the feature engineering thereafter I go and use the logistic regression model wherein I specify the name of that particular model for the training part you can specify the split configuration here and then I specify the model configuration where in I where exactly all the can say model artifacts will be saved which is the feature column would put the label column for this logistic regression model machine learning model and for score since for scoring we are using so you can even specify like since it is using a test range split here you can even specify the regulation and elastic knit params as an array and it will consider each and every configuration and do the scoring for you in one shot instead of doing one by one for you for each run after that you can even specify the evaluator where in like once your model is being trained it can even evaluate for various different configurations specified in the parent grid and then takes the best model out of it in the validator stage and uses we use that particular model for scoring at a later stage and finally it is like you can use whatever target you want to later store for your score data in that so this is one single pipeline which can be used for training as well as scoring so the pipeline which I am showing here is all like originally it was like 5k lines of code which was written over six months with seven developers and this is one particular pipeline which was created in one single day on just one particular pipeline page and it scales to the level which is like 30 million customers with 200 million records and multiplied that by 120 days of data so once you have created the job you can schedule your job where in you can specify what time you want to start your training you can specify the mode which is like whether you want to run this particular pipeline as a training mode or a scoring mode so once your pipeline is complete you can see the different so in the model phase we specified the evaluators and the validators so you can even go and check those artifacts which are saved there for each and every model so for each different version so there were seven iterations in this particular model and for the seventh iteration I can see the best model which was created the estimators configuration the evaluator and the metadata and the sub models which were the models which were not actually picked up for the best model which they are also saved out there in the sub models for visualization parts for the model analysis where exactly it helps you the data scientist is like it shows you the different charts for different classes which is provided by spark so here we have used multi-class metric wherein you can go and see for each different version what was the scores which was generated on that particular training data set okay and then you can easily infer like it becomes easy for the data scientist to understand like what changes he has made for a given parameter and what is the impact on this course using these charts which are very easy for anybody to go and understand not only the metrics which is being shown but it also lets you know if there were like a lot of iterations out there in various iterations what parameter got changed and what was the impact of that particular parameter change be it related to any of the estimators which are used on that particular pipeline model so here if you see I can select different versions and what was the change in different versions which has happened and for the same corresponding version you can see here what was the different so you can see here what was the change across different versions related to accuracy it also lets you know the application ID which was run for that particular scoring for that particular training which version was used for that particular model what was the parameter that got changed and what is the impact of that change on your model so using this it lets you proceed in the right direction instead of like haphazardly going and changing parameters which lets like which actually becomes very difficult for you to arrive and you take weeks or months even to come at the best model for your particular use case okay so the feature engineering optimizations for this particular use case were we used a vector and specially sparse vector instead of pivot we removed the skewness in the data by using standardized scaler for product views then there were unknowns in the demographic data those buckets were not created so we created those buckets that further improved the you can say score during the training phase the custom shiny aggregation code base was reused in which you project one particular record in reverse chronological order for all different months and then do one single aggregation for all hierarchies which are needed instead of doing multiple aggregations for all hierarchies so the final output what we got received after this particular machine learning model was migrated to extremes so we were used to we got to use like train that particular model in 3 hours instead of 18 hours all 4 months of data was used there was no sampling required and that sampling logic was the one which was creating differences in this course there was 30% less resource consumption and 5K lines of Scala code which was written by 6 or 7 developers and the data scientist was there on one particular pipeline canvas now coming to the extremes top differentiators so it lets you create batch continuous micro batch and machine learning pipelines in one canvas it gives you the operators for data wrangling, data cleansing feature engineering, data transformation, scheduling, matrix monitoring and collaboration for you so you only need to focus on your logic development using the drag and drop interface and no need to worry about other things be it related to the monitoring, the failure handling like this easy way to understand do your feature engineering and understanding how parameters impact the scores for your different models so and coming to the kick starts and reusability part so we have a marketplace for real time business use cases we have use cases created for most of the cases which you can just go and download and reuse for your particular domain and use case there it is platform agnostic it can be used on cloud it can be used on premise hybrid it is just a client code we just need to install and you can submit it to any of the clusters there so there is no restriction out and then you can plug in and reuse your existing code so it is not that once you are migrating to this particular platform you are losing all your artifacts which were initially running in production so one of the use cases which you have solved which is like this Wi-Fi analytics consistent in PDB and offers across different channels omni channel recommendation and activating real time behavior with micro segmentation so introducing the team extreme steam out here so Sandeep is the one who is the CEO of the company and the creator of this particular product as the VP of this particular product and then we have Chitral Ankit as the lead developers out here and Vishal Nana Anshi as the developers on the back end and for front end we have Kiran and Dhanashtri you can check out this particular product webpage on extremes.io and you can learn more about this particular product so I only have just 40 minutes which already got over okay any questions so if anybody has any questions we can get in touch with me or my team is sitting in the second row out here Thank you Puneet, can we have a round of applause for him please