 A very special session. We have the Gold Sponsors of PyCon India with us, E-Pam. And we're going to add Anand to the stream now. Hey, Anand. Can you hear me? Hey, hello, everyone. Nice, so you're very audible. You're very audible. Now, let me check the chat if everything is fine. Everything is fine. Now, Anand, the stage is all yours. OK. So hello, everyone. I'm Anand. I'm a delivery manager at the E-Pam Systems. I've been in the IT industry for more than 20 years. And I've been working on productionizing data science workloads. I have been working on data products and data platforms for different customers, trying to kind of get their data science projects from models to production. So today, I wanted to talk about productionizing data science workloads and some of the tools which we use for productionizing them with an E-Pam, of course. So how does a model go from development to production? If you look at the typical AI product lifecycle, so we have the ideation phase where we are defining the problem statement, the scope of the problem. We are defining different metrics, which you're using to kind of define the model or evaluate the model. You're determining how much of the model is actually explainable and whether the end users can actually understand why the model is behaving like that so that they can actually make decisions. Once we have kind of defined the problem well enough, we talk about the data preparation part, where we kind of figure out what kind of data sets are available. Are they clean? Do we need to clean them? Do we need to wrangle them? Do we need to kind of transform them into a format which you can put to the models? Like especially if it is text data, we need to vectorize them, convert them into a vector wherein the model can kind of understand it and go forward. And we need to do that in a scalable way, so we build the ETL pipelines, we build the data pipelines, and we call one set of data a data product. So once we have the different data sets available, we kind of do exploratory data analysis where we kind of do the training sets, visualize the data, try to identify patterns, and decide what kind of ML algorithms we can run on top of this data to achieve our goals. Once we have done all these things, we kind of train, test, and tune the ML model. Once we have achieved satisfactory performance of the model in the train and test environment, we build a binary file. And in Python, we can use a pickle file. And once we kind of build a binary file, we kind of deploy the binary file into the production system. Now, building the binary file would typically be if it is a good engineering practice, we would automate it with continuous integration pipelines and help kind of that would kind of also automate the training of it so that we can repeatedly do it faster. Once we have kind of build a model, we need to kind of expose it to the end users, which can be done in multiple ways. One of the ways is to kind of expose it as a restful service or a web service. Some of the other ways is to kind of store the predicted values in a database, which is further consumed by other applications to do further processing. And finally, once the deployment is done, we have gone through multiple iterations. We go through the production phase where we kind of try to monitor the prediction accuracy and the feedback on the model on a continuous basis. We need to kind of continuously monitor the model because the data might change. There might be new patterns which might be coming in. And we need to be monitoring it continuously. And once it kind of changes, we need to ensure that we go back in cycle, go back using our feedback loop, reprepare it, explore it, retune it, and integrate it with the production system. So for us to do all these things, we need a lot of frameworks. And that framework should be able to kind of help us in building a strong ingestion strategy, be able to kind of store the data, store the model models. Provision, computation resources for us, both for storage as well as for computation. Like if the ingestion is if you are using big data, large data, you would need to spawn off AMR clusters or Hadoop clusters. Or we need to maintain Hadoop clusters, request for resources, build them, and kind of pull it through. Similarly, if it is first in case of storage, we need to talk or we need to kind of store it in HDFS S3 or something equivalent. The framework should also help us in following the engineering best practices so that we also need to be following the engineering best practices where we need support in terms of Git repository, versioning, model versioning, code versioning. In case of data science projects and ML projects, just versioning code is not sufficient. We also need to version the data. Or we need to kind of monitor changes to data patterns. And that would kind of the framework should help us kind of do that. And finally, we should have help from the framework and deployment and operationalization of the model. And that is typically done by the CSCD pipelines in a software engineering project. But in case of a data science project, we would need additional capabilities like the typical continuous integration process or the continuous development process of a software engineering project is not completely applicable for a data science project because we really can't have a lot of automation tests in place. The way we kind of use the feedback loop is different. We can't have a simple pass or fail test. We would need to kind of look at what are the details of the pass or fail part of the execution. So I wanted to talk about two open source projects, which we kind of use. They're open source projects as well as frameworks and tools which we kind of use. One is D-Lab, which is essentially the exploratory environment for doing collaborative data science. And other is ODAHU, which started off as Legion as a project for machine learning CSCD. But now it has kind of gone into a universal environment where we kind of have built multiple things around it. So as part of D-Lab, we generally it's a fail-safe self-service exploratory environment for collaborative data science work. We use self-service web consoles. We can use self-service web consoles. We can build layers like in an enterprise environment. We want kind of specific access to specific environments and limited access across the environments. So we can kind of create isolated sandboxes within the environment. And D-Lab actually helps us build isolated environments for playing around and for building the exploratory environment for us. So we basically do it by enabling D-Lab to kind of demand compute engines, demand notebooks, Jupyter notebooks, it's built on top of Jupyter notebooks, how private data storage is using independent data product, data buckets, et cetera. So we can kind of create different collaboration levels to ensure that the data sources are shared, the core repositories are shared, what level of sharing we can do across the enterprise. So ODAHU is the other tool which we use very frequently, which helps us kind of modularize our entire machine learning projects and machine learning products. So it helps avoid code rewrite. It prevents migration and communication issues. It helps in kind of scaling the entire machine learning project. It helps in kind of provisioning infrastructure. We kind of integrate with, it has an inbuilt CI CD pipeline, specifically tuned for machine learning, which we can kind of execute on. So we kind of use it for both on-prem solutions as well as cloud solutions. So we kind of can integrate with Spark, SQL and TensorFlow at the same time. The tool also integrates with cloud services like AWS, GCP. We haven't yet connected with Azure yet, but with AWS and GCP, we do kind of connect and we integrate with them and use it for spawning the infrastructure required for training models. So it also has different layers like you can build on the local machine, you can build on the training environment or the training machine, and you can have the final execution environment which can kind of finally get promoted into a production environment. So if you look at the continuous integration available in some of these tools, or what we expect from a tool for us to kind of productionize a machine learning model is that we should have exploratory environments which we can use for training and testing data sets. It should enable us to use ML tool chains, which are like chains of tools one after the other. We should have data storages, ability to storage, store different data, integration with data lakes, wherein we can kind of run data engineering pipelines to take the data and actually push it into usable format. We should have training environments and ability to kind of version both models as well as, I won't say data, we can't really version them, but we should be able to identify differences in the data shapes which we kind of get. So we have a Docker images registry inside this tool, which will help us kind of ensure that we push the right images to the right environments. And similarly, we can kind of look at the training data sets, test data sets, and see how the data set shapes change over a period of time. And that would kind of come out as part of our training reports. And we can version those training reports and store those training reports, et cetera. At the same time, we want the computer resources to be spawned up, spawned down whenever they required. So our provisioning scripts, so the tools provisioning scripts will help us spawn up the computer sources, spawn down the computer sources so that we use the computer sources which are absolutely required and helps us build the same costs across the board. So and once the training is done, we can actually push it to the final production where the product is actually executed. We can scale up, scale down the services. We need to log it, monitor it, and ensure that we have the right alerts. Apply A-B testing if you want to do the A-B traffic split. Do A-B testing if you want. And kind of finally get good reports out of it. So in conclusion, for creating and productionizing data science in the real world, we require comprehensive and collaborative end-to-end environment that allows different stakeholders, different participants to collaborate together, work on the same set of data, and work closely, and be able to kind of incorporate feedback easily and quickly across the entire data science lifecycle. And this would help us build a reliable and a repeatable environment following the good engineering practices. Thank you for your attention. And I'm now open for questions. Thank you, Anand. Let me just quickly go through the chat and get some questions out of there. OK, so let me. That's a very big question. I'm just going to break it in chunk and put it on the screen so that everybody can see. So the first part, I guess, goes like this. If I miss out something, please wait for the second half. Yeah. So the direction of whether composition is better or meta-classes plus interfaces are better. That's like for the previous one. Isn't this regarding the same? Oh, OK. This looks like a previous session. Yeah, yeah, yeah. So I guess the attendees are still having conversation with the V from the previous chat. I apologize on that. No problem. If you have any questions right now, feel free to share with Anand on Zulep. You have the stream named as 2020-stage-Delhi. Or if you have a quick question, feel free to just post it right here on the Hopin chat. I'm here to pick it up because you'll have to quickly move to the next session. Anand only had 15 minutes. And we have another 15 minutes for Epam with Stefan. So OK, no problem, no problem. So let's invite Anand to Zulep. And feel free to go on Zulep, Ping Anand. Make sure you put your questions there and ask for any other resource that you need. Anand, yeah. So I will also be available on the Epam's Exceed booth. So you can reach me out there as well.