 Hi everyone! So today we're going to be talking about how here at Ford Intelligence we are making complex art forecast applications into production using Argo workflow. So just to give a little bit of a background we started as an economics consultancy company that focused mainly on time series data and we are now and we have already a first version developing what we call forecast as a service. The idea is that people with little or no background on statistics or economics can run analysis very sophisticated running from traditional statistical algorithms up to auto ML functions and they can do that with just a few clicks and we also focus on having very good front end applications where the person can easily access all of the results and have a good understanding of what's going on in there. As we started to shift from consultancy focus to a technology company we had a lot of challenges. The first one was that we were used to using R in all of our developments so if you look at the background of our scientists, data scientists they are mostly from statistics, economics, engineers and use R on a daily basis so we needed to bring R to production, we needed to use that team know how that we had and we also wanted to do this while optimizing our cost. In the same line we wanted to have simpler and faster deploys because in the beginning it would take hours maybe on a Friday night, Friday evening and we knew that it was something that we needed to do but one thing that we're not willing to let go is the reliability of our results. We really focus on having accurate forecasts and while bringing interpretability of the results to our clients but how did we get there? So our data scientists working on different projects would see the need of a different approach or a different methodology that was new or something like that and they would implement this in R writing functions and putting it inside our pipeline. The idea was that once you have this in the pipeline other teammates could help you debug and also use on their projects but the problem that we faced sometimes was that there was no version in control maybe I was working on a version that I debugged and a teammate was working on a previous one we started doing super repos where we put everything in there and we would have maybe version one, two, three but there was no actual version in control. At the point that we started doing that things got a lot better for us so another big issue that we faced was that we were running the codes locally and with all of the time series that we had maybe we were running a thousand of them with very exhaustive cross validation it would take a lot of time and once you find a bug you have to start all over and we saw the need to go to the cloud it was something that we couldn't postpone anymore. Once we lived and shifted our code to the cloud there were a lot of gains because we noticed that the productivity of our data scientists grew a lot we also were a lot more sure about the reliability of the results we knew that it all that it went through our entire pipeline from preprocessing feature selection modeling and post modeling all of the analysis that we have we were sure that they were being done properly and we did all of that using R. If you look at the way that we used to do things we used R for API orchestrating outputting and communicating with Google Cloud even our front end was developed in R. We also had a monolithic structures if I was running maybe 97 it's a very classic case for us we were running 97 time series and I sent the request via API we would run things on batch so we would send that 97 jobs to one virtual machine and we had four at the time we had a big problem with lines but still from running locally to go into the cloud we had a great improvement but we started having new problems with getting too big for this structure being run in R and that's when we started implementing the new infrastructure using Argo and the Kubernetes. So as the lines started to grow and the jobs would take a lot of time to run we started looking for new solutions. As our focus started shifting scalability became a bottleneck so we needed to be able to scale what we're doing so we could grow and what we wanted to do we wanted our code to be generic and accurate so we want to run as different time series as possible we want to have great forecasts but we want to do it fast. How to make it work in an efficient cloud environment should we keep using R should we change both infra and algorithms that's what we had to decide at that point. At this point we decided to look at the pros and cons of R of keeping using R. The advantages were first that we had a legacy all of our codes were written in R we were very comfortable doing that the team know how even from developers to users they all knew how to use R very fluently and we also have the fact that R is largely used in academics we focus on having new implementations and most of these statistical new implementations are available in R very fast so we could have new approaches new methodologies within clicks and the drawbacks are that there's very limited native support to cloud environments and once we started needing help from other developers outside of this academic word they were not familiarized with R they used more Python or other languages so we had to make sure that we were on the same page we get to the point where we map our solution needs we needed scalability so we need to be able to run lots of time series at the same time and also long ones and within be ready within a few minutes we need a resilience we need to have some retry strategy we need a decoupling dockerization cost efficiency monitoring and microservice friendly so next we're going to talk a little bit about the solutions that we showed in the previous slide and we start with the decoder step once a job is sent via API it goes to the decoder which divides the jobs into different VMs it parallelizes and scales up the job the machines on demand at this point you can see that we kept R for all of the modeling but we are introducing Python in here the decoder is done in Python which is now sent to different virtual machines so what we did at this point is that we fit our sessions in small images we start we took that big images and we divided it so we could be faster and lighter it's good to mention that we got a lot of help from the R minimal community they helped us build a specific version that we needed and once we started working with the images we faced a very non-challenge for our users is that we had package dependencies that conflicted so maybe I needed a specific version of a package for one package and for another I would need a different one and with those conflicts we realized that maybe it was best for us to build our own image and with these images we decided to use as little external packages as possible to avoid conflicts let to leave our image lighter and give us more flexibility inside the implementations so this lighter image helped us decrease our building time from three hours to five minutes it was a great plus for us so now Pedro is going to continue the presentation talking about our architecture solution thanks Natalia so after we optimized our image we had to decide how the new architecture will look like coming from the previous dimensions architecture we decided that easier approach was to turn it into a cloud run application so we wrapped all the R pipeline in a in a Python API and we plugged it after the decoder what it did for us is that it's greatly increased our performance all the jobs and the pipeline was running much faster but we had some issues with cost control we had some problems with cost control so we decided to change the focus and look for another solution so that's where we landed on Kubernetes which is a much more modern approach to this kind of pipeline and for that we decided to use the Argo workflow what Argo did for us is that it enabled all the microservice like structure with a scalable and more cost-controlled architecture all running in an isolated ecosystem so we can run Python processes and R processes all that within Argo and Argo will deal with all the communication with the Google cloud platform regarding the costs Argo allows us to customize our clusters and our pods our nodes to know how much we will spend in certain situations in we can pre-define all memory usages and how much machines we're running regarding the resilience part what it allowed us to do is to implement retry and failure strategies and disaster recovery tools as well and also the application became stateless and much more fault tolerant which all of this combined formed a much more bulletproof architecture regarding the tracking I think it was one of our biggest improvements was that now we have completely the couple log files so when an error occurs we know exactly where the error is so if it's an infrastructure error or a pipeline error and that allows us to call the right people to solve the right problems it provides us with a lot of less time debugging and looking for what exactly is the problem and what is occurring and also we implemented real-time monitoring with Argo workflow dashboards and Prometheus and Grafana one of the biggest accomplishments that Argo brings to us is that it allows us to move much more towards a microservice approach and it allows us to plug more and more application that makes the all the pipeline much less language dependent so for example we started everything in the pipeline with R now we have machine learning models in Python we have deep learning models in Python all that can work well with Argo to build a pipeline that brings the best of both worlds so it can have the best R can provide and the best Python can provide it can have the latest implementations and the newest solutions also what it brings to us is that it allows us to have multiple process with continuous deployment and continuous integration with much more reliability and now moving on to how our architecture actually looks this is how it works our user it can be both an external user or an internal client such as our data science teams they can send their jobs directly to our forecast as a service or they can use our curated time series that we provide both types of clients to better improve their models it will go through an API that will trigger a pub sub event then then we'll start all the process so the client can send hundreds and hundreds of series at once it will enter the decoder the decoder will break it apart send it to the non-modeling pipeline and it will make all those processes using the Argo events to trigger all the Argo workflow events so all the model will occur and it will output our final result and trigger all the events that send the email to our client and that let our front-end for example know that the job is finished so the person can consume it both in their own IDE are using our front-end for post-processing so with the architecture defined we saw the need to set the parameters to allow it to scale in an efficient way first of all use Argo config map to limit the race at which pods are created to avoid overloading the case API we also limited the maximum number of incomplete workflows and the max total parallel workflows that can be executed at the same time to avoid using excessive resources to mitigate IP exhaustion we limited the maximum pods per node and now with an dedicated ML Ops team we now have automated tests and deploys and builds using a plethora of applications which allowed us to finally reach our Github's CI CD which in turn also led to a much more stable level of cost optimization and also designing tests for vulnerability in both containers and package dependencies we are now running all our application in private clusters with no external access at all and all our processes run asynchronous so we have as many different series as memory allows us in a different virtual machine so we are not locked to a one job per virtual machine and I think to wrap it up we also have now running it on a multi-region private key cluster which also brings more resiliency so to wrap it up we changed from entirely built in our application in the cloud with third parties packages doing the communication to a completely cloud native solution which is completely independent from which code we're using in our process and now we can plug in different applications and work without any bound to which technology is trending at the moment thank you very much for watching thank you so much for watching I hope you join us at the Q&A