 Hi, I'm Katrina. I work at SunBasket. Today I'm going to tell you about how I used R to drastically increase the reliability of our forecast process. Slide 2. I'll start by telling you our forecast challenges. Then I'll share how we solve these challenges by re-architecting our forecast pipeline using R. Finally, I'll talk about the impact of these changes. Slide 4. SunBasket is an e-commerce food company. This is what our customer-facing website looks like. There are different meals on the menu each week which we create in-house, and our challenges to predict how many of each item will be ordered in each week. We publish a file that gets fed into procurement systems, and buyers purchase thousands of dollars worth of perishable ingredients based on these numbers. Slide 5. There's a substantial financial impact of both over and under forecast, and if the forecast process fails entirely, then the buyers have no guidance on what to buy and how much. So how do we generate this forecast? Slide 6. This file used to be produced primarily using spreadsheets. Infra from different teams across the company was combined with a lot of formulas and spreadsheet linking. This diagram represents 1-200 Google Sheets that are all inputs to the process. So literally thousands of cells and a mistaken edit in any of one of them could cause a breakage. Also, formulas constantly needed to be manually maintained and extended to show more future dates. Slide 8. Our first step to get this situation under control was to start moving the logic to R and SQL. In this version of the process, we had a handful of scripts that ran in sequence. Each script would read from Google Sheets or a database table, do some basic manipulations, and then write back to another Google Sheet or database table. The nice part was everything we did put into code was now under version control. However, with this method, although we cut down on some Google Sheet usage, it was still relying on quite a few sheet formulas to execute business logic. Also, when we ran these scripts, the intermediates were overwritten, and there was no way to test changes locally without potentially impacting the production job. We also had no way to definitively troubleshoot. If a job failed and then later succeeded, there was no way to know the state of the system during the failure. Slide 9. To solve these issues, I created a system where each forecast went through a standardized set of steps. Each step had predefined inputs and outputs, and every output gets automatically checked at the subsequent stage. Slide 10. What does this look like in practice? We have one large YAML file with most of the forecast configuration. For each product line, there's a separate forecast and a separate entry in this file. Here I'm showing you one example entry. On the left, we've defined what dates we need to forecast for and where we need to send the output. Under inputs, we've defined what we need for the first two stages, input snapshot and input check. We have exactly where to find the data who is responsible for the data and checks defined by rules. Here we're using syntax defined by the validate package. These checks are especially important for human provided inputs. We use the checks like these to check the output of each stage, not just inputs. Slide 11. In addition to the YAML, there is a few stages in each pipeline where we need to execute arbitrary business logic. For these, we encapsulate the R code into a Drake plan. Each of these plans starts with pulling input data from snapshots. This is the one and only place where we can get input data. We do not execute arbitrary SQL queries or API calls during this plan. Therefore, given input data, the result is deterministic. And the input data that we get from this function has been checked and stored in case we need to troubleshoot later. In the middle, we put the actual business logic. At the end of each plan, we store the output, either features or predictions. We use helper functions to ensure they are stored consistently so they can also be checked. Slide 13. The number one benefit of this system is reliability. Daily flow forecast is no longer interrupted by Google formula issues. If there's any issue with a manual input, we know exactly which input was the issue. We've also sped up development time. Since the majority of the business logic is now in R, rerunning is relatively quick and does not risk interrupting the production process. Finally, with snapshots, we no longer have mysteries about the past state of the system. We know exactly what code and what input data went into each run of the pipeline. Ultimately, with all this free time, data scientists can focus more on building models instead of troubleshooting process issues. Thanks for joining.