 Okay. Thank you for introducing me. I'm Shin Seok-seo. I'm currently working for Saffron Tech in France. So today I'm gonna talk about how we exploit the benefits of Jupyter Notebook and the integrated develop environment like PyCharm or VS Code at the same time in terms of using Python package. So these are the outline of this talk. I'm gonna skip briefly. So before we begin, all the materials I did with today are available here and you can find this link to in the in my talk description of your Python as well. And before we begin, I just wanted to know how many of you have some experiences with the Jupyter Notebook. Could you raise your hand? Oh, very good. And how about IDE like PyCharm VS Code? Yes, very nice. So I guess all of you have some experience. So maybe I can briefly skip this introduction part because it just tells about the benefits or advantages of Jupyter Notebook and the IDE and some disadvantages. So I guess you already have felt this lacking feature of Jupyter Notebook because it does not fully support properly debugging your code or code sharing, refactoring, version control, and advanced data reading, etc, etc, which we can find it from IDE like VS Code or PyCharm. So here I'm gonna show you a simple data science workflow using Jupyter Notebook and then the way we transform it to a proper Python package so that we can maintain it using IDE and to have the support from the IDE as a proper software development tool. So basic idea is like this. So on the left side, we are using Jupyter Notebook. So we use this REPL loop with evaluation print and loop. So when you use Jupyter Notebook, we write a code in a cell and evaluate the cell and we see the result right after. Then we might want to repeat this process. That's what we call REPL. And thanks to this feature, we can accelerate our coding as a prototyping or a visualizations, experiments, documentation, etc. And on the right side, at the same time, we're gonna extract some common code for our work, your data science workflow, as a function, class, or module. And then we're gonna maintain it in IDE to have a better support for refactoring, unit testing, version controlling, and debugging. And here in the middle, you will see iteration loop because this left side, Jupyter Notebook and IDE, they are complementary. So on the left side, we do some prototyping visualization. Once it gets matured, we transfer the common code into a Python package, and then we maintain it as using IDE. And then the common code can also be used in the Jupyter Notebook. So this is a kind of a iterative process. So let's move on to hands-on demonstration of a normal or user data science workflow. Today, the main purpose is not showing you how we do data science. I'm just showing you the basic idea, how we use Jupyter Notebook for data science workflow, and then how we transform it to a Python package. So here you can see that we have four different notebooks that deal with data loading, pre-processing, EDA, and prediction. EDA stands for exploratory data analysis. So this is a kind of hands-on talk, so I'm gonna move on to my Jupyter Notebook. So all the notebook here, you can find it in my repository, which is this one. So under the notebook, so after this talk, if you want to give it a try, you can clone this one and you can run the notebooks. So let's start from the first one. Let me increase the font size. Today, I'm gonna use this adult data set, which is about predicting whether the people's income exceeds 50k a year based on census data, and this one is also known as census income data set. The data set itself is not very important for today, but it contains these different attributes like age, work class, final weight, I don't know what it means exactly, but it says final weight, equation, et cetera, et cetera. So when I look at the original data set, which is given as a census file, it looks like this. So we have age, work class, and one line of this census file contains one information from one person, sorry. And as you can see, those values are separated by comma. And then for loading this kind of data, we have our best friend, Pandas. So first, we need to import Pandas. Then we can use this read the census function to load this data. And then we can check it. And when I look at the loaded data frame using Pandas read the census file, I realized that the column names are not very correct. So this is because the given census file does not contain proper column names at the top. So we need to provide this kind of information explicitly by defining this list of column names and then provide it to read the census function, right? So if I do that, now we have a proper column names as a data frame. So this is a kind of normal workflow when we are working with Jupyter Notebook. So we, at first, we try a very basic approach and then we check the result and then we find something is wrong. Then we fix our code like this. Okay. And later part is the similar process. I'm just demonstrating the similar process. So I do some essential check with the Pandas data frame method like info to see the data type of each columns or to see some basic statistics of numerical columns and the categorical world string columns. And then, for instance, in the work class column, we have nine unique values. And among them, the private was the most frequent value. And I wanted to see the nine unique values as well to use by using this unique method. And here I realize that there is a initial space that does not necessary to be transformed the proper value. And to remove this initial space, our very nice read the census function provides this argument, skip initial space. If I set it true, then we will have a properly loaded data frame. So in this first look notebook, I demonstrated how to load the data properly by using this Jupyter Notebook's RAPL workflow. Once I have done that, I would recommend some refactoring recommendations. Sorry. So these are, as a beginner, when we work with the Jupyter Notebook, your code is very easy to be messed up. By, I mean, for instance, your code here should be executed before the other next code, something like that. And then later on, if you want to rerun this notebook, then you might lose the proper workflow. So here, these are my basic, very simple Jupyter Notebook refactoring recommendations. So first of all, put import statement at first. So in this example, as a demonstration purpose, I didn't do that, but I would put this kind of import statement at the top of your notebook. And then, as I mentioned, while we are working with Jupyter Notebook, your cells, the order of your cells can be messed up. And then after everything is working or much short enough, I would put every cells in order and try to follow Pebe style guide and try to split one big notebook into several ones by some logic. So in this example, I didn't put all the data science workflow into one notebook. I separate them into four different ones by functionality of a war or my workflow logic, like data loading, preprocessing, EDA and prediction. And then finally, I would extract repeated code to functions. For instance, in this example, we're going to use this data loading function, data loading code many times for the later data science workflow. So I would extract this as a function like this. And by doing so, I'm not just do copy and paste our original code. I did some improvements. I'm not going to go into detail, but just for instance, I extracted the data file location as an argument of this function so that if the end user might want to load a different data set with the same format, then we can change this argument to proper one. And if we don't do that, we're going to use the default one, something like this. You get the idea, right? And then let's move on to the second notebook, preprocessing. And here at the top, I import pandas, of course. And then we have defined a very nice function that works very well for loading our target data set. And to use that function in another notebook, normally, you need to copy and paste that function. So here, that's what I did. And here we can see some, we can see a drawback of Jupyter Notebook. Because this function, we're going to use this function for the other following notebook as well. But every time you need to load the data, you need to copy and paste this function. So that's the idea of this talk today. So we're going to extract this kind of common functionality into a package. And then we don't need to do this anymore. And the other drawback of doing this is that, for instance, if we want to change the column name here, like fnl wait, then we need to modify the other function in here, in other notebook as well to have a consistent functionality. And if we extract these common things into a package, then you don't have to do that. You just need to modify one package, one code. All right? So here I loaded the data. And then after loading the data, we see, we check the data has been loaded properly. And then I will do some preprocessing. The main purpose of this talk is not showing you how to do proper preprocessing. I'm just to skip this part. The idea is that, for the preprocessing part as well, this kind of code, we're going to use in later for real data science workflow, like predicting. So we're going to also extract our preprocessing functionality code into a function. And then we will transform it to our package later in the second part of this talk. So here I dealt with data loading, preprocessing, and then EDA. EDA represents exploratory data analysis. And basically, this is to have a better understanding on the data set with the help of statistical visualizations. And by the way, I gave a tutorial in PiData 2020 about this topic. And if you want to have detailed information about that, you can go to this repository and have a look. So as I told you, EDA is about having a statistical visualizations to do so. We need to import another library, like Met Prolib or C-Born. Again, to visualize our data set, we need to load the data and apply the preprocessing that we have defined in the first notebook and second notebook. For that, I'm just demonstrating the same thing again and again. We need to copy and paste the function here in this notebook. And then apply preprocessing as well. And it's ready. Now we can have a proper visualization like this. I'm not going to go into detail. But here, for instance, I'm checking the differences between two groups, separated by income, whether it's over 50K or not. And here, I'm comparing the age by income. So here, this is histogram, corner density estimation, and box plot. And I defined this code against age. But what if I want to see the same thing on other variable? Then I need to change these three different lines. I don't want to do that. So I would extract this kind of common code into a function like this. So basically, I just substituted this age with the variable and I extracted it as an argument of this function. From now on, I can just call this function by giving the variable name. Okay. So this kind of EDA function can be also a good candidate for transferring into your package. And finally, we had a better understanding on our dataset. We have a proper data loading function, preprocessing function. And then our final target is to have a proper prediction on the dataset. So at the top, I import pandas and I copy and paste data loading function, load the data. And then I apply preprocessing. Are you familiar with this kind of a data science workflow? How many of you know this kind of thing? Okay. Very good. And for data science, as you might know, we need to do some feature engineering. Here, I'm just using one hot encoder for categorical variables. So X features are looking like this. And our target is income, whether income is over 50K or not. And then we import our nice friend, Cyclone's random forest classifier. And then we split the dataset into train and test. And then instantiate the random forest classifier, train the model, get the prediction against the test dataset, and then compute the accuracy, which is 80%. Which is not that bad. It looks okay. But it's not our final model. We might want to improve our model. To do so, you might want to apply other feature engineering. You might want to use other models like ExigeBoost, Suffer Vector Machine, CatBoost, et cetera. Or we might want to try different cross validation strategies, hyperparameter tuning, et cetera, et cetera. And we might not want to do this. We might not want to follow this process, copy and paste our data loading pre-processing functions every time we want to try these different combinations. And fortunately, we can turn our common functions into a proper Python package. Let's go back to my slides. So, before we transfer our nice common functions into a package, what is a Python package? The definition of a Python package is quite simple. It's any directory with an init.py file. This is a minimal requirement to be a proper Python package. And a package can contain modules, Python files, or sub-package, sub-directories with an init.py file. And Python package, we can say that it is a collection of modules. And why do we want to use Python package? Because it makes it easy to reuse and share code. I think this is the most important concept, at least for my talk. And by using Python package, it also makes it easy to install with pip, or easy install, or it makes it easy to specify as a dependency for another package. Or it makes it easy for other users to download and run tests, or to add this read documentation. And to create the minimal package, I would say we just need to follow these three steps. First of all, you need to pick a proper package name, like pandas or numpy. The package name should be all lowercase. And if your package name contains several words, and they should be separated by underscore. In our example, I'm going to use other data analysis as a name, which is not so great, but this is an example. And this is first step. And second step, you create a package root directory that might contain with me that md file or pyproject.oml file. By the way, this morning of this session, the first talk was about Python package management tools. So it was a very nice talk. So if you're interested in, you can have a look. I think the slides are available in this code. And then finally, we create a package source directory, like this source package name. And then under there, we create init.py file and add Python module files and put your code. And in time of this source directory layout, we have source layout and flat layout. If you are interested in, you can go have a look in this link. And let me show you how I created the package in using this code. So this is my package directory structure. So here, this is my root directory. Normally, your root directory name has the same name to your package name. But in my example, it was in the purpose of this talk, so I have a different name. But here, I have a readme.md file and pyproject.oml file. And here, I have a source and package name directory. And then there, I created init.py file. And I created those three different modules that were transferred from the Jupyter Notebook. So for instance, if I go this module, data loading module, I copy and paste the load data function that we have just defined in our first Jupyter Notebook. I put it here as a module, as a function of this module. And in the pre-processing module, I had these two different pre-processing functions. And in visualization module, I have this visualization function. And then to facilitate the access from package namespace, I imported those modules, the functions in those submodules in this init.py file. Okay? And then we have defined Python package. And then we might want to use it, use the package in Jupyter Notebook. And for that, we have two different options. I prefer the second option. The first option is adding your local pass like this, import this, this pass append and your package source directory. And then now we are ready to import our package in this way. And this is quite simple. But as you can see, the pass is relative to the Notebook pass. And we have to add the two lines of code in each Notebook that you might want to use, that you want to use this package. And this is not a usual way of importing package by adding the source. So I prefer the second option, which is installing the package using pip. And for that, we need to add a package metadata file. Previously, we were using setup.py file or the combination of setup.py and setup.60. But recently, the PyProject.tomr file is the standard. Starting with the PEP 621, the Python community selected this approach as a standard way. So in this talk, I'm following this approach. And then once we added or defined a package metadata file, then we can install our package using just pip, like installing the normal packages. Then we can just import it. So this is the example of a PyProject.tomr file. So at the top, we define a target build system, which is setup tools in this example. And setup tools is kind of a defective standard for building Python package. So I didn't want to change this part. But if you want to use another build system, you can do so as you like. And here we define some package metadata, like name, version, source, et cetera, et cetera. So I added this PyProject.tomr file. Now we are ready to install our package. And then the next issue is dynamic updates of a package. So when we install our package, then we might want to change the package and use it the Jupyter notebook on the fly. And to make it very smooth and dynamic, we need to add these two things. So we need to use Jupyter Notebook's auto-reload magic comment like this. And then pip supports two different ways to install a package, static installation and editable installation. In my case, in our case, for dynamic updates of a package, we're going to use pip install editable. So we're going to check it using the final notebook. So here, let me first show you the way how this static installation is not proper for dynamic updates of our code or package. So first, let me try to install our package in a static way. So I do this pip install dot. So I'm installing our other data analysis package using pip on this environment. Now, I think I need to restart my corner. Now, without adding the path explicitly, we can import our package this way, and then we can use the package function like loading the data in this way. So here, this function is provided by this package here. I'm using this load data function, right? And for some reason, I wanted to change the column name appNlWay to appNlWay with the underscore to better readability. Let's try to move it modified on the fly. So I just modified this function in our package. Let's check it if this modification has been applied or not. So here, you can see appNlWay did not changed, right? But our code has been changed. So to make it applicable dynamically, as I told you in the slide, we need to apply these two tricks by using auto reload of a magic command and install the package in an editable mode. Okay? So let's now remove our package from pip and reinstall it in an editable mode. And then to make it editable, we can use dash, dash, editable or just ensure dash e. So now I'm installing our package in an editable mode. Okay, it's done. After the reinstallation, I need to restart the kernel. And then I execute this Jupyter Notebooks auto reload magic command. Now I can import the package. So here, this is because we just restarted kernel, so our package modification has been applied. Let's try to revert it back to the original one and see if this applies on the fly. So I have saved it. Now if I run this part, so our column name has been changed dynamically. If you don't believe me, let me do it once again. So now our code modification in the package has been applied to our Jupyter Notebook directly, dynamically. Okay? Very cool. So here in this talk, the key point is once your Jupyter Notebook is matured enough, extract your code as a function and then transfer it to a proper Python package, then use auto reload and install the package in an editable mode to exploit the best of both worlds by iterating this part. And I wanted to give you a live example of improving the package, but I don't think we don't have enough time. Five minutes? Okay. I have two examples, but I'm going to show you just the first example. And then if you have any questions, then you can come to me during the presentation. So here our data loading function just do the data loading. But in most of the cases, you might want to apply pre-processing while loading your data. So here we are trying to load the data and do the pre-processing at the same time. Okay? So here I didn't modify our package, and I'm going to go back to my IDE and here I will try to introduce another parameter like pre-processing with the default value to true. And then let's try to if pre-processing. And here you can see by using this IDE, we have a proper, a better code completion. And for instance, if I made an error like this, and you will give me a hint, there is something wrong. So by using this approach, we have a full support with the IDE. Or we can have a automatic let's say automatic code formatting that rely on the black. So if I do something like this, and if I save my file, you see my code format is automatically aligned to the PEP aid. Because I set my VS code to run black every time I save the file. Okay? So now we have introduced the pre-processing argument. And if it's true, we're going to print pre-processed. So here if I load the data and the pre-processed has been printed out. And if I set to force that part, sorry, processing, sorry, then there's no more pre-processing. And here we just printed out, we might want to do the real job. So here you're going to import pre-processing functions. So you see we have a better support from the IDE for code completion, type printing, et cetera, et cetera. So here I modified our data loading function. When pre-processing is true, I apply pre-processing. If not, I just return the loaded data. Let's check it. So yeah, when I say pre-processing force, I don't see the age group. And if I set it true, then we see age group. So we can do this kind of process iteratively. I do coding on package using IDE, and I check the result using Jupyter Notebook. Nice. Okay. Let me finish my talk. So as you can see, we can take full advantage of Jupyter Notebook and IDE by using Python package. I think this will boost your productivity, so give it a try. And as a next step, you might want to publish your package. Please check this link for, as a starting point. And PyScapeWord is a good tool for bootstrapping your Python package. And as you can see, this code is very nice. I recommend you to give it a try. Thank you for listening. Thank you for your amazing talk. So now we have time for questions. If someone has a question, please come to the microphone, and you can ask them. Or in case if someone will come up with a question later, you can always ask it on the Discord, and the speaker, I believe, will answer it. Yes, on the Discord or in the open space. So if nobody has any questions, so I think we can thank again our speaker. And that's it. Thank you.