 Hello all, welcome back to the viewing session. We have Ankit with us who will be talking about building scalable and robust data science solutions at developers perspective. So looking forward to the really interesting talk, Ankit, you could take it on. All right. Hello, everyone. I am Ankit. This is the first time for me for attending a PyCon conference. And I'm going to take you through the topic, building scalable and robust data science solution. And we are going to look at that from a developers perspective. Actually, the title was a little misleading. This is the actual title. What I'm going to do is actually do a lot of ranting. And the reason I'm doing is, you know, Missouri loves companies. So I would like to hear your guys' opinion and feedback on the same as well. Before we start, a little background about me. Sorry, Indra, but I think that's really relatable. Developers rant. OK. Interesting. So we already have a volunteer who is supporting us. So we are all good. Cool. Yeah. So I work for AB InBev, which is world's largest beer producer. I have a background both in data science and computer science and engineering. And I have also worked as a web developer for a number of years. OK. Before we move on to set a little bit of context, again, in today's scenario, we have a lot of data science solutions being built. And the challenge is to take them to production. And this whole path involves a lot of different people with a lot of different skill sets who get involved in the entire process. And because these people are wired differently, there is a lot of conflict which arises. And what we are going to do today is look at what kind of conflicts people are facing in this ecosystem. So there is a very interesting Reddit thread. And you can find the link in the presentation. And you can go through it. So I'm going to take out three, four of my favorite comments from this particular thread and share with you guys. And then what I'll do is I'll share a couple of my personal life experiences in this domain as well. And then we'll try to actually solve the problem and go beyond just ranting. All right, that's the first one. So the data scientist's poor excuse for a pipeline, the DS pipeline, run Jupyter Notebook 1, copy the result and paste into Jupyter Notebook 2 as a variable, and then run Jupyter Notebook 2. Again, so when you go to people and you ask them, like, what are you doing? And they say that we have completely automated our data science solution. We have fully automated refreshers. And when you actually do a little deep dive, this is what you see. So this is neither automated nor you can call it anything which can be called as fully. And still people say that we have fully automated refreshers. We have another comment which is on a similar line. So let's have a look at that. OK, I swear that whenever I ask them to push the code to Git and I'll take a look, they literally push a notebook. And this has happened so many times and it keeps happening to me as well on a day-to-day basis. When you see people write code and push to Git repo and you will see a one huge notebook with thousands and thousands of lines of code. I personally have nothing against notebooks. I love them. I have used them as well. But they are good for a very specific purpose. And going beyond that and taking notebooks to production is a nightmare. So we need to understand that when we talk about production notebooks is not the right way. And one critical thing about notebook is whenever you ask, let's say you have a notebook from a data scientist and you say that this is not giving the results that what you mentioned. So it's usually the response that you will get is, oh, you cannot run the cells in the order 1, 2, 3, 4. What you have to do is you have to run it in a very specific order, which is like 1, 7, 3, 4, 5, 2. Then only you will get the desired results. So how do you take something which is built that way to production? How do you solve that particular problem? One more comment from Reddit. My org is very similar. But the ML and the engineering teams are also separate. Also, the ML people are called as data scientists. And the data science people are called as decision scientists. So it is not at all confusing, again. Why will it be confusing? We have so many different titles for people who more or less do the same things. So business analysts, data scientists, insights, generator, n number of different titles, ML engineers. So why it's confusing? Let's add like five, four new titles. And then it would be really challenging. And then we can say that decide your title. That's the job itself. That's a challenge on its own. All right, final comment from Reddit. Fortunately, our scientists' bands are higher than the engineering bands. But we hire mostly PhDs who can all code to some degree and code as well as software engineers sometimes. Man, I don't know what to comment on this. Because if you find people like this, I would say that you should never let them go. These are not human beings. They are like unicorns. If you find somebody who is as strong in statistics and data science as he is in programming, they can do wonders for your company. So if you do have people like them, please, please keep them in your company forever. Maybe tie them in your basement or something. All right, sharing some of my personal experiences. Because Reddit is internet. People do brag. And I know you guys trust me. So I'm going to share some of my own experiences. OK. And this particular comment is very generic one that I keep getting every two weeks or something like that. So we are walking on something great. We are walking on the next gen evolution. And you would say, OK, that's amazing. You are doing something that's going to transform the world. That's very good. Show me your code. And this is what you will see again when you go and look at the code. And it doesn't stop there, right? So you can come back and say, Ankit, maybe. Sorry. OK, I thought that was a noise. So you can come back and tell me that, Ankit, maybe they are doing amazing things inside the Jupyter Notebook. So I decided to take like two minutes and dig a little bit deeper in this particular repo. So the next thing what I saw, which I cannot provide you with a screenshot for is what they have done is they have taken the entire data which they are using for production and committed that as CSV files to Git repo. And it doesn't stop there. So you would think that's the rock bottom. No, it is not. So what is happening more is you go and look at the commit history, you would see commits like added product XYZ, removed product XYZ. And initially, I thought maybe they are updating the data because they have already decided to commit the data to Git repo, right? But that's not the case. So they have 1,000 or 1,500 lines in the Jupyter Notebook. And they have hardcoded all the product names and column names and everything. So every time the I sort my changes and somebody adds or removes a product, it gets they have to go back in and update the entire Jupyter Notebook. All right, cool. One more experience that I want to share and I want to rant about, OK? So this happened a while back. A few people approached me and said that we want to use this particular OS library, which is amazing. It does what we want. And we want to customize few lines of the code because we want to change a few behaviors of the project. And then we want to use that. And I was like, amazing. That's amazing, right? We all are fans of open source here. Go ahead, do that. And if possible, go contribute back to the upstream library. Please, please do that. And two weeks later, when I look at the code snippet, what they have done is they have taken the upstream code, which was a bunch of classes which was very well written. They have copied all the classes into a single giant notebook with 500 lines. And they have changed the five lines that they wanted to change. And the obvious question I asked was, why would you even do this, right? I can't even imagine. I never expected this. What is the reason? What is the thought process behind? And the two reasons that I got is that for the first is it works. And the second is it's easy to do, right? So how do you come back to a statement like that? And to be honest, it's not a rhetorical question. If you guys have an answer and you know how to deal with situations like this, I would be hanging around in the hallway. Please reach out to me on Zulip or the hallway and let me know how to handle situations like this. Cool? OK, all right. Enough of ranting. We have been ranting for a while now. And we do want to solve the problem as well, not just show what is the problem. So the problem that we want to solve is how do we take something which is built for analytics and take that to production? And this is what I am recommending. Obviously, there are many other ways to do. And this is what I think is going to work out well. And let's have a look at it. And maybe I can get some reactions later in Zulip or again the hallway. For this particular conversion, let's bucket whatever you are doing, whatever softwares you are building, whatever analytical solutions you are building into three different buckets. The first one is let's call it bucket one, which is the setup, preprocessing, and post-processing that you do. So basically everything besides the actual algorithm. So you might be reading the data, treating outliers, null values, doing feature engineering. And after running the algorithm, you might be applying certain business constraints, transforming the data so it can be consumed by REST APIs or whatever is your end consumption point. So all of that comes under this particular bucket. So for the next bucket is the algorithms itself. Let's say that you are doing more than just importing a random forest from SQL learn or something. So you are designing your own algorithms, which can be a unique approach, which can be something else. And that comes under this particular second bucket. And the third phase is obviously orchestration. How do you orchestrate the entire processes? Cool. Let's talk about each of these buckets. So the first bucket is, again, setup, preprocessing, and post-processing. And this is one of the things which is, again, very close to typical software development and software engineering. So a lot of the principles that applies there also apply here. Obviously, they're with minor changes and things like that. So the first thing you should do is you should move away from notebooks. Notebooks, again, as I mentioned, are good for POCs, are good for EDAs, but they are not good for production. They are not production friendly. So move away from notebooks, modularize your code, use smaller classes, which are easy to understand, easy to debug, easy to trace back. Obviously, extensive unit testing is an important factor. I'm going to talk a little bit about unit testing later and we had a lot of sessions today on, let's say, yesterday, sorry, on hypothesis testing as well. That's also an interesting piece. But no matter what is your testing framework, you need to have extensive tests. Just because it's a data science solution doesn't mean that you get a free pass of not writing unit tests. So you must implement extensive sets of unit tests. You need to abstract your data layer, how you are accessing your data. So let's say you are using something like ADLS today and then you decide to move to something else. So let's say you decide to move to a data warehouse from ADLS or the other way around. Do you have to go in and change your entire hundreds of lines of SQL queries that you have written in your code? What about the data models? So if you rename a table, if you rename a column, do you have to go back and touch your code at all? Or do you have abstracted that data models logic somewhere, which you should be doing, by the way? Extensive logging. So we also had a session yesterday from S Anand on how to do logging in a nice way, how it can make your application seems faster. So that's also a critical piece. And it also helps you in debugging your code better. And the final piece, again, is inbound and outbound data checks, which you should be doing anyway. Because if your data is not reliable, your solutions and your results will never be reliable. Again, this list wasn't supposed to be an exhaustive one. So there are a lot more things to do. And each of these titles can be a session of their own. So I'm not going to go deep into this. And we will defer that for a later period of time. What we actually going to look at is the second bucket, which is how do you build novel algorithms that you are writing as Python packages? And how do you take them to production? And the third piece is orchestration. Again, that's more or less very similar to what you have been doing in a software engineering environment. It doesn't really change a lot, with obviously minor conceptual differences. So whatever you were doing before, you can follow a similar process in and around orchestration. So let's talk a little bit more about second bucket, which is how do you take your algorithms to production? And this is assuming you have built an algorithm, which is novel, which is not something present already in Escalar, or a similar API. So this is what you want to achieve. And this is a simple example that is taken from the Escalar website. A few things that you must notice here is it's how intuitive, how smooth, and how easy it is to get started with Escalar. With this example, it takes literally 30 seconds, or less, for me to have a working classifier that can do prediction on a data set. Of course, the classifier might not be amazing, but it takes me less than 30 seconds of time to get started and implement something using Escalar. And this is what you want to achieve when you are building a novel package out of your algorithm, that you want to achieve similar cleanliness, similar intuitiveness, and similar smoothness for your algorithms. All right, but how do we actually do it? How do we actually achieve what we have been talking about? So the way I see it is the actual implementation piece is a minor effort that is required. But the major effort required is to take a lot of design decisions. And I have categorized those design decisions into four different pockets, which I'm going to talk about a little bit in details in the subsequent slides and also this slide. And each one of them, I think you should spend a little bit of time with your team and understand your use case and take a decision before you move ahead and actually write the code to build the package. OK, so the way I have categorized those design decisions is based on what or who is interacting with your package. So the first view that I like to call it is the user view. So basically, how do you design your package or your API so that it's very intuitive, very easy to use for your users? So what we have been talking about in the previous slide as well. So it should come naturally to your users. So if you look at most data science and machine learning libraries, they have something as simple as clf.fit x and y. So whenever you are writing your packages, it should be in line with that. You should not be doing something very different. So whatever you do, it should come intuitively for your users. It should be very easy for them to understand. You should have sufficient documentation. You should have sufficient examples. Again, I'll go back to SQL learn. It's so easy to find an example for any API that I want to use from SQL learn. So you should do that. Again, your API should be reliable, readable, and you should not break backwards compatibility often. Next set of categories is the actual implementation of the package. So you have to take a lot of design decisions which are around software engineering principles. So what exactly is going to be the architecture of your code, right? So what is the folder structure that you are going to follow? What are the different coding standards you will follow? Pep 8 and Pep 484 are the most common ones, but it doesn't have to stop there, right? So you have to define what exactly are the different coding standards that you need to follow for your packaging code, OK? Again, unit test. Unit test is a very critical piece, and I cannot stress this enough. You need to have extensive unit test coverage, and maybe test which goes, again, beyond unit test, which is something like property testing, which we were discussing yesterday in one of the sessions, or something, integration testing, or data testing, which we talked about in a DVC session today. So it really depends on your situation, but you need to have extensive unit test. And writing unit especially is a little bit challenging in data science environment, because there is a lot of randomness inherent to most of the algorithms. But you have to decide what actually you can test and how you want to test it, and what actually you can let go of. So the next category of decisions is around the infrastructure point. Exactly how you want to share your packages. Do you want to make it public or do you want to make it private? If you are going for public, PIPI is kind of the industry standard. If you are going for private, two solutions that I have explored are Microsoft Artifactory and JFrog Artifactory. Obviously, there are many other solutions in the market. More or less, they all do the same thing. So pick one and go with it. But you have to make those decisions before you move forward. And the final bucket is the processes, which is actually one of the most important ones as well, because this is how you communicate with your users. So how often do you release? Do you do something like semantic versioning, which is like major, dot, minor, dot, patch? Do you release on demand? So whenever a feature is ready, you do a release. Or you follow something like Ubuntu's release cycle. So they release on 4th and 10th of every year. So you have to take those decisions. What kind of licenses do you want to use for your package? Are you going to make a GPL3 open source or MIT? Or do you want to use something as close source licensing? Do you want people to report back bugs to you? Do you want people to contribute back to you? How exactly they do that? You have to simplify the processes. And let's say in certain scenario, when you have to break backwards compatibility, how do you communicate that to your users and explain to them that why have you broken backwards compatibility and what exactly they can do to transition to the newer version without going through a lot of trouble? So these are a lot of the decision points that you need to make. And trust me, this is the most difficult part of building actually the packages. Actual writing code is much easier and it can be done in less than an hour. But taking, it's very important that you spend time taking those decisions so that it's really a very good experience for your users who are going to use your package. Actually, let's now look at how actually you can build a package, right? So we have taken all the decisions. You have your algorithms code ready and you want to build a package out of it and you want to deploy that. So the way to do is basically create a setup.py file using setup tools. And I'm going to share all of this code towards the end of the presentation. So you create a setup.py file and you provide all this information in the setup.py file. So if you go to, let's say, pypy and you look at escalon or any other package for that matter, right? So you see a lot of meta information. That meta information is what is being powered from this particular setup.py file of the respective packages. So here you basically tell your users information about who is the author, what the package does, what are different keywords and a whole bunch of things like that. Okay, so once you have your setup.py ready, the next step is to again do authentication with the server where you are going to host your packages. So basically the artifact tree, right? So for this example, I have taken a private J, what I'm using is a private JFrog artifact tree but it doesn't really matter. So the way you authenticate with the private repository and you tell the repository that you have right accesses to push the packages is by creating a pypyrc file and then you provide the following information. The first keyword local is the name of that you want to give to the repo. I'm calling it local because contradictions are nice. Then you can provide a repository URL, then you can provide a username and a password or you can also give a token. So now your authentication is ready, your code is ready and now you are ready to build the package and deploy. And it's as simple as running one of these two commands. Again, it builds slightly different, builds your package slightly differently and we can get into details of this offline. But if you want to distribute source along with your package, you can use something like S-TIST and if you want to build a wheel package, you can use the first command. So this command builds the package and also deploys to a remote called local. And it's basically unless you have a very, a lot of files in your package, it's gonna take you less than 30 seconds to build and deploy a package with this command. Cool. Now we have a version up on JFrog or PIPI or whatever server we are using, right? How do your users install it? Because again, this might be a private installation. So basically what you do is typically people, sorry, the typically people install a PIP package by running the command PIP install XYZ. So they have to provide an extra parameter called extra index URL and supply the, their access, which would be a read access with a username or token and a server URL. So that's the only difference they have to do. Again, there are other ways so that you don't have to go through that difference as well and you can implement those as well, but it's as simple as that, right? So just adding one more parameter and it gets your package from a private server. Cool. And finally, we are basically, so from your user's perspective, nothing, there is no difference for them if they are using SK Learn or this particular package. And if you see the example on top for them, it's very simple, it's very similar, right? So they just import the package that they have installed and use it like SK Learn or any other package. And if you see the deployment instances, it's again, we started with the objective of implementing something like SK Learn, both in terms of API and also achieving versioning and production ready packaging for our algorithms. And that's what we have achieved on the left, you can see, which is again a private propo and you have different versions of your code and you can again, add and remove APIs like you see happening in SK Learn. And again, just to conclude, this is again, if you remember and go back to the slide that I was sharing, this is only a part of the entire data science solution. You have to do production ready code for your pre-processing and post-processing and orchestration piece as well to achieve something that can actually truly stand test of time, okay? I think you guys can download the presentation and the materials used and the code from this particular GitHub link. I am open for questions now. Thank you. Hey, Ankit, that was a really nice talk. And I think there's a really innovative way to share the presentation. The QR code. Thank you. Yeah, so a great talk on building scalable and robust data science solutions. I think we do have a couple of questions. Let me start with the first one. So could you shed some light on S-TIST versus V-TIST? Okay, so it's basically the format that you are distributing in. S-TIST means source distribution. So that means that you are including your source along with your code versus in a wheel package where you might not be including your source. Again, there are minor other differences as well, but that's the major one. I see. And there are many more questions. I think people will reach out to you on Zulip, but I'd really like to have a small conversation about how has the whole process been? So generally when you're building a data science project, you start with the EDA and then you start from there and then build an entire project. And you said, you start generally with a notebook and then move into a Python code or how is it? Yeah. So whether we like it or not, everything starts with a notebook in a data science world. So it's, and I don't think we can challenge that or change that, so that's going to be staying the same. And that's fine because I have built data science solutions and the EDA capabilities of Jupyter Notebook is really good because you can interactively get responses. So that's what we do usually. So the POCs are built in Jupyter Notebooks and once the POCs are validated that it has some value, it's producing some business results that we intend to, then we try to think of how we can take this into production and move it out of a Jupyter Notebook. Let's see. Okay. And so when you say scaling systems, so when you say scaling, so let's say you're starting with a scikit-learn as a module and then building an entire system around it. And then when you say scaling, are you talking about moving to something like PySpark or what exactly is scaling? Yeah, so again, it's very specific to the environment that you are working on. So what we do for us scaling is because we deal with a lot of data and usually in the POC phase, we might be working on, let's say one city. And when you go to production, it might be like 150, 200 cities. So what we do is initially, again, the same thing from pandas to PySpark move, right? So the initial implementation is usually pandas where data scientists also prefer to work with pandas. But again, when you are dealing with a large amount of data, it depends on the scale, right? So we have to move to something like PySpark which is more distributed. So would you like to shed some light on common issues that you would face while you're switching from pandas to Spark? I think one of the things that a lot of people that I know have gotten to or have come across is how Py4j is something that creates a lot of issues on the PySpark side. Yeah, so I think, so there are two parts to it. One is the infra part of it, which actually is much simplified right now because we are using something called as data picks if you have explored that. So that's basically the whole Hadoop infra as a pass. So you don't have to manage the infra. So that's part is solved. But the actual moving from pandas to PySpark is definitely a difficult thing to do. Especially because there is steep learning curve in terms of the APIs, which is very different in pandas versus PySpark. That's one of the major challenges that we face is the learning curve, Py4j and some of the technical challenges. I think we have faced that, but not a lot especially I think because we are working with data picks and we can reach out to them for support anything related to infra. But the biggest challenge is it might take you two lines to do in pandas but it takes you like 15 lines and two hours of going through Stack Overflow to write the same thing in PySpark. Right, right. So thanks a lot for bringing in the actual industry view of how things happen because a lot of people have seen code written in things like scikit-learn in snippets and examples, but when it comes to production and when you need to scale that is when the real challenge is coming. And thanks for sharing light on some of those things and also thanks for sharing a whole lot of tools that you actually use. So I'm sure a whole lot of people from the audience will go back and check out the presentation and learn more about all of these tools. So thanks a lot Uncle. Thank you, thank you. Happy to be here. Thank you, bye. It was a really insightful session, thank you. So soon we'll have our next speaker coming in. So kindly sit tight. Thank you.