 Hi, thanks everybody. So I'm here to talk to you about adapting Kubernetes for machine learning workflows. And just a quick note about the dream team here. So Keith and I are both senior software engineers at Bloomberg. And we have about combined 15 years of tenure there. So on a daily basis, we work on the same internal machine learning platform. But we do sit in two different parts of the engineering department, which kind of encourages a healthy balance between the minds of compute researchers and the minds of kind of AI researchers, or at least it's supposed to. And a shameless plug, if you're interested in learning a little bit more about what we do in data science, you can take a look at our blog for things about the kinds of publications that we have or kind of contributions we have in open source. Cool, so our outline for today. First, we'll take a look at our ecosystem at Bloomberg and what kinds of data and applications we have available to us. And then I'll turn it over to Keith, who will tell you a little bit more about how we've adapted Kubernetes to solve the kind of challenges that we have with machine learning and these kinds of applications. Right, so let's talk a little bit about machine learning at Bloomberg and what kinds of data and things we have to work with. So in our client's daily workflows, message and communication tools are a key part of the spread of information within the industry. You can see this really with this volume of messaging and communication data that's increased over the past 10-year time frame or so. And this volume of information has increased and become so untenable, in fact, that every part of the trading workflow, from portfolio selection to risk analysis to strategy development and really execution, has been automated somewhere by someone using machine learning in the past like two or three years or so. So the markets are really getting much more efficient and time-to-price with this new information and it's getting shorter to the automation. So one such kind of automation you might think about is in reaction to events. So if you make the assumption events move financial instruments, then our questions really become, how do I figure out which events move which financial instruments and how do I figure out the occurrence of such events of interest? So here we have a timeline of an SEC announcement back in like 2010 or so and they published it on their website which was really unusual at the time and people weren't really looking at it. I think they look at it a little bit more today. But suffice to say it was a little bit unusual and this price reaction here by the Bloomberg headline was really precipitated by the Bloomberg headline. So all they really did was they re-announced that the SEC had a lawsuit of sorts. The New York Times is in this picture only to show that commodity news or news that most people see rarely have ever actually impacts the price of stock prices because it's typically minutes or days or hours late, things like that. So the next slide. So this actually happened in a slightly more modern take. The impact of social media was a wonderful cesspool of rumors and the potential for gaining a big advantage in the market. Somebody mistakenly tweeted on the AP that the Barack Obama was injured and then the markets really reacted really quickly. And if you're having a little bit of deja vu over the past couple slides, maybe something brings about with Elon Musk over the past couple weeks or so. So for Bloomberg being a financial news company at the center of the industry, our entire business is really contingent on the ability to create these applications that are really scalable, smarter, really focused and correct for our customers. And I've hinted at a little bit of it so far, but just as for our users, this really means taking advantage of sort of automation wherever possible. And here are a couple different examples. We have the news applications as well as the ability to extract events that are happening externally or in the industry, increasing relevancy of results and so on and so forth. Cool, so clearly there's an enormous scope for machine learning and a wealth of data at Bloomberg, but let's take stock and kind of dive in at the requirements that it might take to support some sort of a life cycle for like a machine learning model or to develop some sort of machine learning based software. So machine learning based systems are developing what I see is quite a challenging situation at Bloomberg or elsewhere in a large enterprise. And this really comes from, and these challenges are really driven by the different priorities for different stakeholders. So first and foremost, for our customers, we really value privacy and uptime and stability for the service that we provide via the Bloomberg professional terminal. And there's really little room for error and really little room for A.B. testing as well. From an engineering standpoint, we'd really like to support these customers as well as possible. And this really means having the ability to seamlessly scale the less operational burden, the fewer, the less room for making manual mistakes, the better, and the ability to kind of work within a team and share ideas is really great as well. And so with the ML practitioners that sit within the engineering department, this really is only exacerbated, I really want to, this is really crucial because the less time that we spend building infrastructure, the more time you can spend really looking at the latest research or evaluating ideas that you've seen at conference and so on and so forth. So let's say that a product manager comes up to you and you'd like to deposit a new feature for one of these applications or they have a new idea for an application. So these are the kinds of problems that you'd really like to be focusing your time on if you're tasked with one of these solutions that you have to develop. So what does your data look like? What kind of features can you glean from your data? What kinds of algorithms are best suited for this problem? What kinds of parameters can I tweak? And once I've deployed an application, when is it really appropriate to update my model so that it still accurately reflects the kinds of data that I trained it on when I was in development? So you might notice here that there's an implicit assumption that all the infrastructure that you need is readily available for you to use. But that's not always the case and some of these questions are much more difficult to answer than others. So sometimes we had some stumbling blocks along the way and some things are more difficult than others, right? So in a large enough company, data ownership and access can be fairly difficult to become quite complicated. And hopefully you've mastered some sort of transferable skills by using open source toolkits such as scikit or TensorFlow or Jupyter or things like that, but it's not always the easiest to kind of install and become operational depending on the maturity of your development environment. And compute power is not completely unlimited so sometimes you have to schedule these workflows in the same way and not really annoy your neighbor too much. So from an infrastructure perspective, these are the questions that you'd really like to avoid having to answer, but too often these are the things that kind of take way too much time that you spend building up and ultimately the efficiency of your workflows and the accuracy of your models really depends on the infrastructure that you have and the tools that you have available. So once you settle and occasionally compromise on some workflow for the problem that you're working on, you go into what I refer to as the machine learning model development lifecycle and it looks a little bit like this, it's a big infinite loop. So on the left hand side we have what we refer to as kind of the training process or this offline procedure and then the right side we have our online process or what's serving actually predictions. So on the left again we kind of have our task, we define our business problem, we gather a bunch of data, we run training processes over a model, maybe some neural networks, maybe some sort of scikit models, we evaluate it, we go kind of in an infinite loop on the left hand side until you exhaust and repeat your resources and maybe at some point you kind of hit your multiple capacity or you hit a deadline and you have to go and actually publish something for production. So when it's good enough, you go to the right hand side and then you publish something, you release it in some sort of constrained manner, you monitor it until it's time to go to the left hand side to iterate on some new ideas or kind of reevaluate your task at hand. So the question is, how do we really enable the researchers to really focus on those, the why questions earlier, not on the how questions that take way too much time. So I leave you with some ideas here. And this is the part where I kind of kick it over to Keith to tell you how we've adapted Kubernetes to achieve this kind of nirvana for these problems. Thanks Anja. Well, okay. So I'm gonna talk a little bit about Kubernetes from sort of a high level and talk about why we thought it was a really good fit for the machine learning problems that Anja sort of brought up here. But just to give you some background on this, we started this project like two and a half, three years ago before it was really common in the community to be running machine learning on Kubernetes the way that you sort of see a bunch of ideas in the community sort of converging right now. So it's been cool to follow along and see how everyone is sort of interpreting this sort of solution. So let's start with the obligatory what is Kubernetes website quote. Kubernetes is an open source system for automating deployment, scaling and management of containerized applications. Maybe you've seen this before, maybe you're familiar with it. But if you're not what I want you to take away from this is that Kubernetes is really good at letting application developers work on their applications and then allowing them to deploy their software without having to think about infrastructure by putting them into their application into sort of a container or like a zipped application which they can sort of ship into the cloud. So from a really high level your usage with Kubernetes might look something like this. You have a user, we'll call her Katie and she wants to submit some API resource into Kubernetes. So she does this using Kube CTL which is like the command line interface and that goes to the API server. And then in the background there's this thing called a controller which watches for the creation of these resource types or updates of these resource types and it sort of kicks off this life cycle. And the controller basically says what is the new desired state of the world? Let me compare this to the current state of the world and then reconcile this by sending a list of updates back to the API server. And actually in the real world there's many controllers that work on different types of resources in the Kubernetes ecosystem. And eventually a pod is gonna get created in Kubernetes and the pod is sort of the lowest level thing that can get created which will run on a server and the pod contains the container which should be running and the API server will schedule this pod onto a node or a server for you. So let's look really quick at a core Kubernetes type in this YAML format. This is a deployment for running your application in Kubernetes and it's sort of an always up type of deployment. So first we'll look at the type meta. This is just sort of defining the namespace and the kind of this YAML that we're looking at. Next is the specification and that's sort of the properties of the deployment. Here we're saying we want a replication factor of three and what we want replicated is this pod template or pod spec template of some version of nginx and that'll be replicated three times. So what's really happening here is when we create this YAML there's a controller in the background that reacts to the creation of this and says okay I see that there should be three pods and then it goes creates three pods and makes sure that they're all there. So inherently YAML is declarative and reproducible and human readable which is great and it also gives us a bunch of benefits of Kubernetes, right? Like hardware indifference the application developer didn't need to think about what server was running on. The API server just kind of scheduled it for you. You get free monitoring for like CPU and memory metrics from Docker and Kubernetes and if you add other open source software like Prometheus you can get application metrics. You get replication so like guaranteed uptime from that replication number with the minimal amount of work you can add auto scaling and through other open source libraries like SEO you can add like service mesh for service to service security. So right I haven't really talked about machine learning at all I've just talked about some benefits of Kubernetes so wouldn't it be really cool if we can marry these two properties together and make something that speaks ML but also takes advantage of all the good of Kubernetes. So we've done this with something called custom resources which is built into Kubernetes and I think earlier is called third party resources now it's called custom resources if you've heard of that before and when I say resource I basically mean what I've been talking about deployments, pods, these like core API objects in Kubernetes and a custom resource is just an extension to that which is a user defined resource type usually with a corresponding controller that does all the magic in the background. So let's look at one of the types that we've created in Bloomberg for TensorFlow which has been a really popular toolkit for ML practitioners. So first is our type meta and you can see here that we're now in the Bloomberg DS namespace because this is no longer a Kubernetes native object but it does work in much the same way as any other Kubernetes object would work. This next part is also using some built in Kubernetes metadata which is available on all your types of objects which are annotations so if you're scheduling a bunch of jobs together you might be working on a project or an experiment and you wanna put some labels on it so you can basically throw this in the top for when you're referring back to it later. Now more importantly is the specification for this TensorFlow job. We've sort of abstracted away the concept of containers for everybody because you get into a whole world of craziness sometimes. So if you wanna use TensorFlow you choose our TensorFlow 1.7 framework which makes sure that we have all the right binaries and GPU drivers and everything you can possibly need. For data access we've found that having to set up your key tabs for security and HDFS or your config files for sort of the Hadoop ecosystem could be difficult and managing all of that. So we've added first class data citizenship to our types and you can specify this through this identity type and we make sure all your secrets and config maps will be mounted into your running pod. Now because you're not creating containers anymore you need some way to specify your runtime dependencies and your entry points so you can do it like this and the size parameter I think is kind of neat because instead of setting specific requests for CPU and memory you can now set sort of like a T-shirt size of small, medium, or large and with a flip of a switch switch it to GPU so maybe you were running some tests with this YAML on CPUs but now you wanna run it for real on your GPU so you just change it from small to GPU large and that's the only thing you need to change and it's often scheduled on some specialized hardware. And this last bit is just the arguments that get passed into the application and you can see the user here is intending to use HDFS and their Hadoop identity which we talked about earlier. So just to bring back sort of the fundamental pieces we wanted for solving this machine learning problem by adding this sort of size parameter and owning the containers we can build these fundamental building blocks that give people easy access to specialized hardware like GPUs, adding first class data citizenship to our YAML types makes data much more liquid and easier to get at. The declarative nature of Kubernetes and encapsulating it all on the driver makes people's development experience a little bit easier and most importantly with the custom resources we can speak specifically ML in Kubernetes. And we've extended this for a couple of other types in Bloomberg mainly Spark for distributed ETL more generally for Python if you're using something like PyTorch or scikit-learn same thing for JVM based runtimes and if you need something interactive we have support for Jupyter and we also have a whole separate operator for black box hyper parameter tuning on these other types of jobs. So just to wrap it up at Bloomberg we have this huge amount of data and we've needed machine learning for this data so it's been a great opportunity to leverage machine learning but to do this efficiently we really need our ML practitioners to think about ML and not have to think about infrastructure and we do this by introducing Kubernetes which allow the infrastructure folk to think about infrastructure. So thanks for keeping this community strong and being involved in open source and if you have more questions please contact my manager and head of data and analytics at Bloomberg we'll be on a paddle with some others tomorrow afternoon at four o'clock.