 Hello, I'm Jimmy. I'm a staff software engineer at Cata. So today I will talk about automatic refactoring in large and some co-bases. So we have a large co-base and we want to solve two problems, co-formating and type checking. Those are common Python co-base problems, but it's especially challenging in our large co-base. So we are Cata. So it's a financial service provider. So we provide online tools for staff founders to issue staff options to their employees, or when investors invest money on staff, investors get staffed from the staff. So all those different people, they can use Cata to manage their equity. And other than those, they can also trade and also get other services on top of this platform, including compensation, valuation, tax, and financial means. So we have a large co-base that has developed for 10 years. It's Python code managed in Git. It has more than 2 million slides and 30,000 files. Every day, 200 active developers continue to commit changes to it. So with such frequent updates every day, it will be hard if we want to commit a large change. Let's say we want to modify 20,000 files at the same time. It's hard to merge this change because it will basically conflict with everyone's change. And let's talk about co-formating. So Python provides pretty flexible code style. So you can write the same code in very different formats. So in this example, the red block and the green block, they are actually the same code in different formats. You can use single quotes versus double quotes for string or use different indentation to wrap or unwrap the lines into multiple lines or combine a lot of things in one line. In a large co-base, if every developer, they are using different formats, then the co-base will be less easy to read. So we will want to apply a consistent format the way everyone can read and write fast. So luckily, there's a popular tool black for that and we want to apply black in our large co-base. And it comes with the problem we just talked about. We couldn't just run black on our large co-base all at once. It's hard to merge in such big change. And what if we take a different approach with split changes as smaller pieces, each one only modify like 20 files, then we can incrementally convert our co-base into 100% black format coverage. So this approach may work, but it comes with other challenge. It will introduce actual work. You will need to spend some effort on creating and managing those small changes. You also need to make sure those changes, they don't overlap to each other. And another challenge is since it's an incremental process, we need an effective way to make sure the developers don't introduce regressions during this process. If some developer reformated the code into a different format, then it's causing regression. And we also want to apply type checking in our co-base. So in Python, it's common to get a lot of type-related errors like attribute error, type error, or value error. It's because Python is not a strong type language. And one effective way to catch those is to get the help from MyPy. So if we have MyPy type checker running, it could have make those suggestions like the blue box and help our developer catch them. But in order to get MyPy working, we would need to add type annotations. So for each function definition, we would need to add the annotation for each parameter and we return. And those are the work that looks simple. But in a large co-base, it's challenging because in our co-base, there are more than 100,000 files. 100,000 functions that means type of notation. And if we would need to manually annotate each one, it will take forever to finish. So that's also challenging. And our solution is to use automatic refactoring to solve all those problems and challenges. So the idea is what if we can automate those incremental code changes? If we can do that, we can solve large scale tech problems in large code base. And let's take a look at how a code change lifecycle looks like. So this is actually how our developers do their job every day. They will take some code paths that they want to apply some code change. They make the code change, create a pull request, then it will kick off some continuous integration tests. And the tests may pass. Then they will want to add some reviewer to review the change. If the test fail, it could be the change is wrong. So it should be remained. Or it could be just flaky tests. And maybe we try after some time, we'll make the test pass. And after adding reviewers, if we get enough reviews and approved, we can merge the change. Otherwise, we may need to add more reviewers or sometimes need to add need to notify the reviewers in another channel like stack. And sometimes we also need to rebase the pull request to apply the latest change from the main branch or merge. Or when there's merge conflict, we need to close the pull request and then recreate any one later. And so in this lifecycle, we want to automate everything. So the pink block are those we identified can be automated. So we want to build a automated refactoring framework to automate those. Then let's talk about the design and the implementation of this framework. So there are several considerations we want to design. The first one is the size of the code change. We want to make sure each code change, it's not too big. Otherwise, it's not easy to review. It's not too small. Otherwise, we will need to create too many pull requests which introduce too much overhead. So one way to solve this is to just while we walk through the file tree of the large code base, we can count the number of Python files in a sub tree. And we only take the sub tree as the target or code change path when the number of files is less than our defined threshold. So this is one way to implement it. Another consideration is the number of pending pull requests. If we create too much, then we will introduce too much work to our developers because they still need to do their daily job. If we create too many, then they will need to spend all of their time on reviewing automated pull requests. They won't have enough time to do their work. And if we don't create enough, then we won't be making progress. So one way to implement this is we can potentially add a consistent label of automated refactoring to each created pull request. That way we can use the GitHub command line tool GH to do a query like this to search all the pending pull requests that has this automatic refactoring label. So we can run this query to check the pending number of pull requests and only starting to create more when it's less than the desired threshold. And the next consideration is we want to avoid duplications for each code change. So that is to avoid conflicts. And we can encode the path that we are modifying as part of the pull request. One way to implement this is to encode the information as part of the git branch name. So for example, for black formatting related changes, we could potentially use black formatting as the prefix and the target directory name as the prospects. And another consideration is incremental adoption. So we want to avoid regression while we adopt it so we can introduce an enrollment process. So the idea is when a file is black formatted, we add the file path to an enrollment list. So it's a text file that consists a lot of paths. That way we set up a continuous integration job that always run black formatting check on all the files on this enrollment list. So we can make sure no future code changes can break the black formates on any of the file. So with this approach, eventually we will add all the files paths into this list. Then at the time we can remove this list because we reach 100% coverage. This same approach can also be applied on type of annotation adoption process or any other problems. So in order to simplify the implementation, we can implement each PIMP box as independent jobs. So let's think of this as a production line. So each job is an independent job that move the pull request from one state to the next state. So it's like state machine transition. And so if we make each job independent, we can just run those jobs periodically. So then they can just move the pull request forward to the end of the life cycle. And what are the jobs needed? We need a job to apply code change and create a pull request. Another job to check the test status and add a reviewer if it pass tests. Another job to check the review status and add more reviewers or send select notifications. And if it's approved, another job will merge the pull request and another job to close the pull request if conflicts happen. And we also need to provide an API for our application, the refactoring applications. For example, we can use Python abstract best class to define an API refactoring applications. And then an application like blackboard mating can just focus on implementing the API to provide needed metadata. So the metadata we would need when creating the pull request, including the pull request title, the labels we want to add, the body, the commit message. So with those, we can provide enough detailed information when creating the pull request. Then the reviewer can just base on what we provided in the pull request to review the change. We can implement the two different functions. The first one is skip a pass because we know it's an incremental process. So we want to provide a helper for each application to decide whether they want to work on a pass or want to skip it. And for blackboard mating, we talk about using an enrollment list. So black, enroll that text. So we can just use a helper to check whether this path is already enrolled. If it's enrolled, then we want to skip. The most important logic is in this refactor API. So we want to do two things. The first is given the pass that we want to apply. We want to refactor. We will run black, commit, and given the pass that black will format the entire path. And the second step is to add the past to our enrollment list. So with an API like this, the application just need to focus on the metadata and the refactoring logic. And the remaining work is taken care of by the framework of automatic refactoring. So in order to implement pre-article job, we can use GitHub workflows. You just need to add a YAML file like this in the GitHub workflows directly in your code base. With this config example, you can set up a job to run every 30 minutes during the workday. And it runs some Python code. Let's take a look at some job examples. So first job is to apply code change and create a pull request. We can use PyGithub to create a pull request easy. When you use GitHub workflow, you can easily get the provided the GitHub tokens and get the current GitHub repository name to set up the PyGithub library. Then the main logic here is to basically walk through each refactoring application by calling a subclass helper. For each application, we want to check what's the current pending number of PRs. If only if it's less than the desired threshold, then we will fetch the next path and then run the refactor logic. And then after that we already refactored the code, we will commit the change and then create the pull request. And when creating the pull request and making the commit, we are using all the metadata provided by the application. Another example job is to merge the approved pull request. So again, we are using PyGithub to make a query to find all those approved pull request. And for each found pull request, we can just call the merge helper to merge it. In the earlier example, we use black command to refactor the code. You may want to build some custom refactoring yourself. In that case, we can use some other tool. Let's take type annotation as an example. So let's say we want to add some missing types. Some of the simple functions we can base on simple heuristic inference to add the missing types. For example, some functions like init function, they don't have any return statement. We will want to add return now. And some functions, they only have one single return statement at the end and it's returning a simple string. Like this get bar function, we will want to add str as the return type. So we can use DeepCST to implement a transformer to do this refactoring for each thing. So in DeepCST, it provides a helper to help us convert the source code as a syntax tree. So we can call the parse module in our refactor logic to convert the text as a syntax tree. The root node is a module. Then we can call our transformer add missing non-return to convert the syntax tree to add the missing non-type. And then eventually we replace the original file as the updated code. So the logic are in this add missing non-return. So it's a transformer which allows us to register some callback function to be called during the tree traversal. So we can define visit function definition function to register a function. It will be called whenever we enter a function definition node. So with this we can implement this. So basically we need a counter which is the return count. We initialize it as 0. Whenever we enter a function definition, we also reset it as 0. And then when we found a return statement, we increment the counter. And then at the end of the function definition subtree traversal in the leaf function, we will check whether this current function definition is missing return type. And if it's missing return type and the counter is 0, that means this function return nothing. So we can safely add the return annotation as now. So this is a simple example. So we can build more complex re-vector. And some other more complex dynamic types you may want to get help by collecting runtime types. So we can use monkey type tool for this. If monkey type is an open source tool, you can run your program along with monkey type to collect the traces of each function call. And then you can run monkey type command to apply the collected types to your code. So with this you can just run monkey type in the re-vector step to apply the re-vector. So we already covered implementation. Then let's talk about how it works in Kata. So for black formatting, we made black tool available in our code base in 2020. But after a long time, not much people start to use it because it requires mainly brown-black until we start automatic re-factoring work. We were able to quickly ramp up the coverage to 100% in a few months. So now our entire code base is in black format. A similar pattern happened in type notation. So our type notation coverage, it was growing pretty slowly since 2019 until we use automatic re-factoring to add the missing types. We were able to quickly increase the coverage in a few months. And we are still continuing the work. While we are working on increasing the type coverage to make it more useful, we have seen the number of type errors in our production environment decreased over time. The number is fluctuating. It depends on the users and the traffic to our service. But we do see trending is decreasing. So here are the summaries of the talk. So automatic re-factoring is useful if you have a large code base and many tech debt to solve. Because the automatic re-factoring framework enables you to save a lot of manual effort. It also allows you to fix tech debt problems incrementally and continuously. And it's not only applicable to Python. So when we implement our framework, we make the API extensible for our programming language. So we are actually also using it for TypeScript in Carta. So that's it. Thank you for your attention. So if you are interested in more technical things in Carta, we have an engineering block medium. We also have a lot of jobs opening. And for all the tools mentioned in the talk, you can find the links here. So that's it for today. Is there any questions? We do have a couple of minutes for questions. Would you like to come up? Hi. Thanks for the talk. A question from experience. So in large monorepositories, oftentimes you end up with multiple copies of the same files, modules, et cetera. And each of those copies can then develop a life on its own. So during the re-factoring phase like this one, what's your approach for identifying the duplicates and dealing with duplicates that may have diverged by then? Yeah, I think duplication is a common problem in large copies. And based on my experience, I didn't solve this problem yet. But I know some people may use some tools to detect duplications like SonarCube and then just analyze the code and try to fix the duplicative code manually. But that is not scalable. Yeah, another approach we did is we tried to develop some tools that can help us. We marked some piece of code as duplicative code and forbid the other people to use it using linters. So we developed a tool called Decomposition Toolkit to help our developers use it to mark the code. And then they can track the progress. They can count the number of references in a dashboard, and then eventually the internet will prevent new references. That is my experience. Hi, Jimmy. Great talk. Thanks. So it seems like this framework would be pretty useful for anyone running on GitHub. Any plans to open source it? Or are you aware of anything already open source that's in similar state? Yeah, that's a good question. As far as I know, I didn't know any service like this is open sourced. In our implementation in Carta, we started the implementation with CircleCI, not GitHub workflows. Yeah, so I think it's probably more easier to use if it's implemented in GitHub workflows. So that is something I will consider to discuss with my team to make it open source. Yeah, because other than the implementation of the workflow, we implemented a lot of Python helpers to make building a framework easier so that library could potentially be open sourced. Well, that's all our time, so thank you so much, Jimmy, for your presentation. Thank you.