 You guys can hear us? Yeah, Mike working. All right, sounds good. Awesome. So we've talked a lot about challenges in the Rust community. And one we're going to be talking about today is a pretty harrowing one if you've been part of the Reddit community. And this is our over-engineered solution to it. So my name is Sushin Gurungan. I'm a data scientist at a Seattle startup called Bikuri. We focus on applying machine learning to user retention management. And you can catch me on my site, Sushin.co, and my GitHub handle, Pekisops1. I'm here with my former co-worker, Colin O'Brien. I'm Colin O'Brien. So I'm a software developer at Rapid7. We do computer security work, incident detection, and response. You can find me online at InsanityBit on Twitter or almost any time on the Rust IRC where I go by static assert. Cool. So this talk is about our experience building scalable machine learning systems with questionable data. And it follows this kind of outline. We're going to talk about technical debt and data science, what that means, and how it applies to machine learning. And then we're going to talk about a toy machine learning problem that we solved and really tried to solve entirely in Rust. And it demonstrates how Rust is useful in tackling a lot of the technical debt I talked about in the first section. And then we're going to talk about our vision for Rust machine learning, where we see it now and where we want it to go. So the basic motivation for this talk is this problem that a lot of data scientists experience day to day. They're trying to work between research and production. These two types of data science are really different in terms of their motivations and goals. Research data science is one-off. You're working on your local machine, maybe loading Python notebooks. There's no real engineering requirements here. You're usually doing data analysis, maybe building models. The production data science is where you're building machine learning models that maybe work in the cloud or on some device. There are real engineering constraints like any software product. You need to be thinking about reliability and consistency and performance. There is a lot of technical debt and a lot of difficulty associated with moving from research data science into production levels systems. Now, this is a basic diagram of what service oriented machine learning looks like. And it's pretty stripped down, but this is kind of what the basic framework is. You first have a data collection phase in which we're collecting data for maybe an internal repository or external resource. And this data represents some sort of problem that we're trying to solve. It could be really messy. It could have missing data points. It could be really non-standard. It could be from very unstable resources. We don't know. The next stage is data investigation, where we're trying to identify trends and understand our data. We use these trends to extract features on our data. We try to normalize the data, try to get into a format that is very machine learning friendly and friendly to the model. That includes filling in a lot of missing data. The next step is model generation, which you're probably familiar with. We generate a model that does some sort of classification or clustering or pattern recognition. And that model does some sort of predictions on the fly. Now, it seems like the model is the biggest aspect of this pipeline and the most time-intensive aspect. But in reality, this is the life of a data scientist. You're always working with horrible data, especially in fields like we work in, or I worked in, he currently does, like security. Your data is biased. It can be very dirty, missing values. You're always wrangling with data. And this means that a lot of data scientists spend 99% of their time doing feature engineering. And that's basically the first three steps of this pipeline, data collection, investigation, and feature extraction. Within the feature engineering phase of this pipeline, there's a lot of technical debt that accrues, that directly corresponds between moving from research into production. First, we have a lot of solid teams. Researchers and data scientists usually work in organizations that are kind of sectioned off from engineering. And that means that a lot of code that is handed off into production has to be re-implemented to meet engineering requirements. Pipeline jungles can arise from complicated transformations of code that are hard to scale, hard to reason about, hard to maintain. And unscalable experiments in which you are handed monolithic code, maybe written in a language that is not well-suited to parallelism, cannot meet the requirements of the organization on production level. And on top of that, you have all the normal technical engineering debt that you accrue with any software product. So this is a really big problem. And what you end up seeing is that a mature machine learning system might end up being only minority machine learning code and 95% feature engineering code, glue code, code that's really just stitching together kind of the messiness of the process. And we know we've got a lot of our motivation from a paper out of a Google group that discusses this at length and we cite it here. Short, long story short is that building production level data science services is really, really hard. And one thing that gets overlooked a lot is that machine learning services are not only dependent on the quality of the model, but also quality of feature engineering and data ingestion. Before you can start tweaking the parameters of the model and the complexities of the model, you really need to make sure that your features and your data are reliable, safe and consistent. So of course, where does Rust fit into this story? Rust is a systems programming language. It's kind of very focused on being low level but high level too, but what such an eye have found is that what makes Rust a great systems language, the safety and high level abstractions also helps us pay down tech debt very quickly in some of the smaller projects that we've done together. But we wanted to see what happens when we take Rust and apply it to something where you accrue this tech debt so rapidly and in these kind of unique ways, essentially applying Rust to the machine learning data science process. So we spent a lot of time on the Rust subreddit. There's a lot of good conversations that go on there. It's where I get the majority of my Rust news. So let's have a look at some of the posts on that subreddit. So we have Vaga, it's a container tool. Someone's looking for feedback on their Rust program, a blog post on interior immutability in Rust. And then I'm streaming Rust for a while. Come watch me die. If you've spent time on the subreddit, you may have seen a post similar to that one and it's a little confusing when you first see it. The reason for this is that there is a video game called Rust and they have their own separate subreddit but once in a while, someone wants to post about the video game. They end up on the Rust language subreddit and we get a post like that one. So our goal, we see this problem and we think this is a great candidate for machine learning. We'll just learn what the different posts look like and then we can automatically sort and filter them like that. So why not do this entirely in Rust? So we'll build a Rust classifier for Rust posts for the Rust subreddit. So we've got this first stage, right? We've got to actually get our data. We need to investigate it. So we wanna collect it first and store it. This is what our data is gonna look like. Basically just mimics what the Reddit API returns. You can see there's the author, the text of the post, the score, things like that. So the first time we ran this, all we really wanted to do was just get some data so we could start investigating quickly. We didn't care about running this 100 times or being very stable or anything like that and the evidence of that is that we've got these unwraps in our code base. Essentially what those are saying is yes, this might fail. We're asserting that it doesn't fail. Just crash the program otherwise. When you're collecting thousands of posts, you really don't want code like that. It's okay one or two times but you don't wanna be collecting thousands of posts and then like 90% of the way through your program crashes. It's not a lot of fun. Thankfully, Rust makes it really easy to take code like this and put it into a more production friendly format. So as you can see, we've added these tri macros here. Essentially all that saying is that yes, this can fail, return early if it does and then that's made explicit by the type of the function. So when someone calls this code, they know immediately, all right, something can go wrong here. I need to handle this somehow. Essentially proper error handling for the caller is enforced by the type signature of the function. This is very different from a language that might have unchecked exceptions. In a language like this, everything's fairly straightforward, right? We make our request, we extract the data and then we return the value but what's hidden here is what this function's actually doing. You don't know that your wifi not working great means that this whole program is gonna potentially crash. Knowledge of the implementation is actually critical if you wanna handle the errors. This can become very complicated when you have a bunch of different feature extraction processes that are doing a bunch of different crazy things and you don't know where errors might start to crop up. Great, so now we have our data. We want to do some sort of investigation. We want to explore the reddit data to identify those features that differentiate the two subreddits from one another. I mean, the important thing to note here is that we couldn't really use rust for this aspect of the pipeline. This is where really dynamic languages like Python and R really dominate and here's why. The natural feedback and instant feedback and natural investigative feel of the Ripple and the interpreter really lends itself well to expiration. We could make really small tweaks to our code and we're able to really just visualize and get the results fast. Graphing and visualization is a huge priority in these languages and it's not as much in the Rust community right now and so we had a lot of really out of the box solutions for this type of stuff and other languages and in these other languages and in this process, performance and stability doesn't really matter. We're really just trying to understand our data and usually our code needs to just run once so we can iterate faster by just not thinking about language level details like types. Here's an example of a Python notebook that we ran that just did some basic analysis of our data. You can see it's just really small code snippets that need to run once. We don't really care about performance constraints. Here's the first snippet is basically identifying the most common authors of the Rust subreddit. Logic is the most common and the second snippet is just looking at most common words in Rust posts. So now that we understand our data, we wanna put it into a format that the model can understand. Machine learning models don't do well with strings or names like logic or words. You wanna be able to take the information from those words and express it in a way that the machine learning model can work with. Essentially what we wanna do is take that raw post data that we saw earlier and turn it into this processed post features. So instead of an author, you have author popularity. You have word frequencies for interesting words that we found through our data investigation process. In a language like Julia or Python or R kind of the heavy hitters for data science, you're likely to be interacting with your data through a data frame structure. It's easily queryable. It's got this like columnar tabular format but it also tends to have a couple of problems. Mixed types in your columns and rows can lead to degraded performance or erroneous values in your data set. We're not really big fans of that approach and something like Rust, we really like our types. We want to hold on to our types as much as we can. We like the performance and the safety they provide us. So how do we actually interact with our data in Rust? So we took this typed approach using structures. As you can see, we have a vector of our raw post data essentially that vector corresponds to this data frame like approach that you can see in the bottom there. We want to select a column of that vector. So we just select the, we create an iterator over it and select that field, right? And then we apply our function to it by just mapping over that column. This is fairly ergonomic and familiar coming from an iterator or a data frame syntax oriented language but there are a couple of other benefits to it too. So the rayon crate makes parallelization really easy, especially using this pattern. Essentially, we've changed none of the code except that we've added this dot pair iter. And now when we select our column, same exact thing happens except we can iterate over our data two or three or four at a time using the system resources to their fullest. As an example, we decided to look at label encoding as a simple benchmark. Label encoding is where you would take string values and just convert them to unique integer values so that the model can understand that they're different things. This is really commonly used in this data science process to handle categorical string variables. The Rust code was pretty consistently 15 to 20 times faster in terms of processing speed. But what was really important was that the memory usage of the Python code shot up pretty significantly as we scaled. The Rust code was very easy to reason about from a performance perspective. We did not have to worry that when we doubled or tripled or increased the data size by an order of magnitude, our performance would suddenly get sporadic and crazy. The Rust code simply performed as we expected, which is really important as you start getting new data sources and collecting new data from different areas. Great, for the last part of the pipeline we want to generate a model and perform predictions. We want to choose machine learning algorithm, fit it to our data and then store it so we can load it somewhere else to do those predictions. We hope we've convinced you that Rust is a good fit for dealing with a lot of technical debt that we outlined earlier in the talk. So now we want to talk about kind of state of the Rust machine learning ecosystem for this section. Rust is a pretty nascent but developing machine learning community. There are about 60 plus crates with tags machine learning and linear algebra on crates.io, the amounts to about 800K downloads and 500 plus versions published by those authors. There are great ways to know and record the current status of the machine learning community by going to arewelearningyet.com, checking out the small IRC channel machine learning. Should we hope people join? One pretty exciting area of work is in numerics. Rust linear algebra is pretty cool. There's been a lot of work on benchmarking linear algebra code in Rust and we've done some benchmarking of our own. Here you can see some benchmarking of the dot product in Rust and comparing that to OpenBlaz, which is a Fortran optimized package that's used in a lot of really popular linear algebra libraries. We were able to achieve pretty comfortable performance to OpenBlaz on our local machine by easily parallelizing with Rayon. And this makes us really optimistic about work in this area and we really hope that people do more work in this area. This is the basic code for model generation and prediction. It's really simple. We're just fitting our model to a feature matrix with some ground truth labels and then performing predictions on novel data. We can serialize the file to disk and then deserialize it for predictions. We ended up using Rustlearn as our library of choice for model generation and we chose the random forest as the basic model for doing the predictions. What was the final result? How did we do? So this is some example output. The basic output of the model is a probability score between zero and one. How likely it thinks that this particular post is part of the Rust subreddit. And so you can see what's your favorite piece of Rust code is pretty likely to be part of Rust. Never let your guard down around Naked is probably not part of Rust. So it's cool, the model's able to pick that up. But the cool, the interesting stuff is kind of the more confusing posts. Like ability to copy slash duplicate maps. So if you just show me that, I would have no idea what subreddit was part of. But and the model seems to be slightly confused but probably due to the text within it, it accurately chose it as a Play Rust post. Here's the general accuracy of the model. We had about a 99% accuracy as shown by this graph and that amounts to about like a 98% false positive rate, true positive rate, sorry. Less than 1% false positive rate. So that was really good. We're happy with it. Here are the top features that we're out of that kind of will pull out of the model. Really around word frequencies. So you can see things like IO, Rust, GitHub, Ampersand, which is cool. Yeah, instruct language projects. So you know all words that we commonly associate with the Rust language. So we were pretty happy with the results and hopefully we'll stop those Play Rust posts in our community. So there are a couple of other reasons that we really liked Rust but didn't want to delve too deeply into. The first of which is that Rust has really great tools for testing, benchmarking and documentation. We did try to improve performance at various times and having the ability to easily test code and ensure that we didn't make regressions which did happen and test caught was just absolutely critical to maintaining a stable consistent feature extraction pipeline. We also really enjoyed the high level abstractions. You saw us using the iterator syntax to kind of emulate a data frame. That same iterator syntax often feels just like list comprehension or something that you may be familiar with from something like Python. There are multiple examples of this that are very appealing if you're coming from a dynamic language. And then of course, working with the Rust language community was consistently a pleasure. I mentioned I'm on IRC a lot. I've always gotten consistently great responses when I ask questions. Very, very helpful community was just great to work with them. So there were a couple of areas that Rust fell short. The machine learning ecosystem is somewhat fragmented right now. Rust is a young language and machine learning in Rust is even younger. So this is not surprising. There are multiple different crates with some of the same invitations for different algorithms. So it's a little difficult to know which one you wanna choose. The visualization tools don't exist. Honestly, we really like using Python for visualization so this wasn't a really big problem for us. There's this closer to the metal aspect of Rust. You kind of have to understand to some extent that there's a stack and a heap and there are two different things and you can't keep references to the stack and things like that. And that can be a little intimidating if you're coming from these higher level languages. And yeah, this one's not really Rust so much as static languages in general. Have a hard time with the data exploration phase. Python lends itself so nicely to that because it's dynamic and you have this instant feedback loop where you can work with the interpreter like that. And the machine learning community is still kind of sparse and this goes back to the ecosystem being a little fragmented. There are multiple great people working on these projects but to a large extent they're working independently. So given that there are a couple of areas we would really like to see pushed forward with Rust machine learning. Yeah, so we wanna see Rust promoted for this one specific area feature extraction. We personally feel that Rust is actually ready to be used in this area. Whether the machine learning content is or not, Rust was absolutely wonderful to work with in terms of just extracting features in a meaningful, capable, clean way. We'd love to see Rust used as a teaching tool for machine learning. This is something that Python has with scikit-learn. When I wanna learn how some machine learning model or works essentially, I go to the scikit-learn documentation. It has a great explanation about the model at a high level and then right there there's Python code showing me exactly how to use it. You kind of end up picking up this Python code and approach as you learn to do machine learning. We would love to see something like that with Rust. We would also really like to see standard implementations of data science tooling. One project that was particularly interesting for us was the NDA ray matrix implementation. Right now the machine learning community hasn't really centered on any one in particular, so different models use different matrix implementations and that can be kind of a sticking point when you're trying to test out different areas of code. We'd also like to see the community around machine learning start to grow more of a focus and create goals for themselves and take actions towards those goals. I think everything up until this point is something that Rust is perfectly capable of dealing with, but it's about the community getting involved and having buy-in and starting to talk to each other more. Different create authors for these machine learning algorithms, we would love to talk to you about how we can start making moves in this direction. And I think that's all we've got, yeah. So does anyone have questions? Yeah, yeah. Thank you, Steve. So we can probably take you in a few minutes. Nice, nice. Yeah. So are we gonna get a bot on our Rust that automatically posts saying, did you mean to post a play Rust? That is like the long-term plan, definitely. I've never worked with like Reddit bots, but we do have a system where you could query it and you would get back your prediction score. So it's just a matter of... We need a core name with the moderators with that. Yeah. That'd be awesome. Sure. Sush, can you go for it? You. I noticed that like in your vision for Rust machine learning, you don't mention like visualization tools. And you said earlier that it did matter to you that much. Why aren't you pushing for more, like better visualization tools in Rust? You wanna take this? Yeah, sure. I think any machine learning system will be a hybrid kind of framework of multiple tools that are specialized for specific things. And we weren't able to find any comfortable language or framework or tool to Python that was able to achieve the explorative feel and the easiness of the language in this area. So we kind of see that Rust is pretty optimized for the feature engineering aspects, the processing and achieving that performance that you need for that area of machine learning. But it doesn't need to solve the entire problem, right? And so we think that Python is probably still a great area for work in this area. Right, I think it comes down to priorities in the short term to get that visualization and data investigative aspect of Rust would probably be a significant amount of work. And it would be solving a problem that we don't really have. So we felt that Rust solved that tech debt problem so well for us in many, many areas. But that tech debt problem didn't really apply to that investigative phase. Python just did it really well. So we wanna solve the big problems first. Thank you for that question. You mentioned one of the shortcomings was interactive sessions with the data. I'm curious, I mean, so from the feature extraction step kind of would lend itself well to having highly interactive sessions. Like does this algorithm appropriately extract these features, whatever. So I guess two part question is like one, did you try, did you ever experiment with using Rusty or Rusti, I don't know how to pronounce that, the interactive Rust REPL? Oh yeah, so I am familiar with the project. I wasn't aware that it was in a state where you could actually start working with it. Yeah, so we haven't looked at it yet, but that would certainly be pretty exciting and something that could maybe alleviate that area a little for us. And I guess the follow on to that question is I guess what would you like to see, like in that maybe we don't ever get anything is, or like I don't wanna say ever, but we currently don't have anything advanced as some of the data exploration tools that are out there. But what would be some of the features you'd like to see immediately when it comes to exploring, like your initial exploration of the data to figure out what to extract? It's interesting, yeah, I mean, I think so much of the data investigation phase is kind of ad hoc, so like you just load up your data and just your first couple of minutes are just gonna be poking at it, like seeing what's there, seeing what's missing. It's hard to imagine that without a REPL and that might just be a limitation of our experience, but the REPL lends itself so well to that poking. It might be the case that like more abstractions over the language and internals of like, not having to think about a lot of the language details would help a lot more in this phase, right? And we just wanna be focusing on our data and not really thinking about language. So there's a lot of the abstractions that Nico was talking about in the keynote may help towards this end. Make it easier for people to get ramped up on it. We got time for one more and I feel like we should pick somebody in just half of the room since we went out here. Yeah, sure, next to the camera. I just really wanna run everywhere, that's okay. Thanks for the talk. I was wondering what Python data frame implementation you're using in the comparison between Rust and Python performance? Sure, so we use the pandas data frame and for the label encoding, we just grabbed the scikit-learn code. So it was very idiomatic Python, what you would probably end up choosing if you were writing this in a production setting. The Rust code was essentially a close mirror to the implementation. Thanks so much guys. Great, thank you. Thank you.