 Morning, everybody. How's everybody doing? We get to be here on a Friday. We don't have to be in the office, and we get to talk about Ruby all day. It's a pretty good day, and we're going to actually start off getting to do something that I think is pretty fun, I'm very passionate about, and that is getting to talk about one of the hottest topics in the dev space, which is machine learning, and there's been something that I haven't really liked about the machine learning space, and that is that people think that in order to build out a machine learning application, you need to use a tool such as Python or R. I don't like that because a number of different reasons and we're going to walk through those in the guide today, because I'm going to show you how you can use Ruby, and Ruby by itself in order to build out a full machine learning application. Just a little bit of background on myself. My name is Jordan Hudgens. I'm the founder of Dev Camp and the CTO of the Bottega Code School. In addition to that, a part of my background, especially as it relates to machine learning, I'm currently a PhD candidate at Texas Tech University where my focus is on machine learning and how it can apply to adaptive learning systems. What I'm going to talk about today is, I'm going to get to show you a real-world product that we've built for Dev Camp. When I was preparing this talk, I was having a hard time picking out how I wanted to start it off. There are a bunch of different ways that you can start off any kind of conference talk, and I decided to take the approach of making it like a real project. Whenever I'm building out some type of code project, the very first thing I try to do is I try to pick out the main objective for it. The main objective for this talk is I want to demonstrate how you can utilize machine learning to build a recommendation engine all in Ruby. By the end of it, if you have a feel for all the steps that are needed and the architecture and the components that you can use in order to do that, then that means I have accomplished my goal. Now, when I'm building out some type of code project, after figuring out the main objective and defining that, the next thing I like to do is to build out wireframes. I like to say this is either the architecture, these are all the components that I need for the system. What we're going to be discussing are all of these components. In that diagram, you can see that we have the Dev Camp learning management system. That's where all of the recommendation articles are right now, and that's where they're rendered. Now, that is connected to what got built out. I didn't want to integrate the machine learning component into the LMS itself, because I felt like I may want to use it for something else. What I did is I created a Rails API only application, and that's where all of the machine learning components reside. Inside of there, there are a number of other components. We have a content crawler. The content crawler is something that goes out, it scrapes websites and it goes and it finds content on other coding sites that may be of benefit and may be able to give recommended articles and supplement our material to the students. Inside of there, we have source articles, which we're going to talk about and we're going to see how we can parse those and work with them. Then we have third-party articles. This could be something on, say, medium or Stack Overflow or just all the different places you could research additional material. Now, from there, what I built out was a recommendation model. That's where the magic of the machine learning system really happens. That's where we're going to talk about concepts that you may or may not have heard about, like tokenization and building weights and associating those weights with different parts of the system and then there are a number of machine learning algorithms out there. There are hundreds of them. The one that we're going to discuss, because I experimented with a few of them, but the one that gave me the best results was what's called the naive Bayes algorithm. That is one of the most popular ones out there. If anyone here has ever used Gmail, naive Bayes became very, very famous in our circles because Gmail uses algorithms like that for content or for spam filtering. That's one of the top ways it knows if a email message is spam or not, but I'm going to show you and there's a reason why I wanted to show this specific algorithm. It's because all of the case studies when you look at naive Bayes are all for spam filtering and I want to show you, you can actually use it for many other use cases. So far we've talked about the main objective of what we're going to walk through. We've talked about the basic components and the high level architecture. Now it's time to get into the high resolution system. So you're going to see the end product right now. You're going to get a preview of what was actually built. So here this is a shot of the DevCamp LMS and this is a JavaScript pay or this contents for a JavaScript tutorial on a sync await and that is, if you've never used it before, it can be a kind of complex topic. So if you scroll all the way down to the bottom there, you see these recommended guides. Now these guides are not in DevCamp. It's not in the LMS. Now these also are not hard coded. So when whoever the instructor was who created that tutorial, when they go and create it, they didn't go and say, okay, I want you to add these recommended articles. We have currently 1700 articles in the learning management system. If we had to manually give the recommended articles, it would not be a very dynamic system. And so what we needed to do was to utilize machine learning to automate that process. If we wanted to do that manually, we'd have to have an entire team full of instructors that are constantly sourcing the web for thousands upon tens of thousands of articles to find the most relevant ones and curate them. But I'm gonna show you how one system does all of that automatically and how we're able to use Ruby in order to do that. Now what I talked about is the high level process. Now I wanna get into why this is beneficial. And the reason for that is because I feel like the why in this case, why I decided to build it, explains what was built. And in order to do that, we're gonna jump into a little bit of a time machine and go back about 20 or so years when I read one of the most influential books I ever read called Carry On Mr. Bowditch. Now this is a fictional book, but it taught me a principle related to learning that stuck with me even to this day. And it's also exactly how I built this system. So it's a story of rags to riches kind of scenario where someone specifically in this case, a guy named Nat Bowditch started with no formal education. They had a passion to learn. And he was taught how to read and he used that to the extreme to the point where he would go and get every single book he could find. But instead of just doing what many of us do where we'll read a book and if we run into a topic that maybe doesn't make perfect sense, we may just kind of forget about it and just keep going. He took a very different approach. He decided that anytime he ran into a word he didn't understand, he would go and look that word up in other books and he would read that and he would continually source and create almost this kind of this network of learning. And then once he understood it, then he would go back. That's what I wanted to do with this recommendation system. I wanted to create kind of a little Mr. Bowditch inside of each one of the students pocket. Someone that could be there and constantly could give them additional information to help them understand because I'm not sure about everyone here. I don't understand something usually the first time that I read it or it's told to me. Every once in a while I get lucky and it happens but many times I'll hear something, it doesn't click and so I have to go two, three or four different spots to hear it from different perspectives and then it starts to actually sink in and I understand it. That's the whole goal of this system. I would love to say that every single piece of curriculum that we write makes sense the first time but it doesn't. Many times you have to hear something from different perspectives in order for it to sink in and that's what this machine learning system does. It takes articles from all across the web. Currently we have a library of over, it's either somewhere between 10 to 20,000 articles and it sifts through those and it learns from them and learns from the content and then it learns from the student and then it gives a recommended article based off of that. So here we're gonna take another case study and we're actually now gonna start to get into how the system was built out. So when I originally wanted to do this I didn't really know where to start. That's a pretty big project. I didn't know the right way to build it out and so I did what I tried to do which is to take a complex topic and then I tried to boil it down to the most dead simple explanation possible and in this case, I didn't really know how to build the system out yet but I knew that I needed to know what each piece of content was. If I was gonna give a recommendation, if I was going to say that I'm gonna try to give relevant results right next to these posts and it was all gonna be automated the first thing I knew is that I needed to know what the article was about. So even though I didn't know how to build the system I knew how to work with strings in Ruby. Ruby's very good at working with strings. So the first thing I did is I started parsing all of the content. Each one of the posts has somewhere between 1500, 2000 words and so I started parsing that, just splitting it up and then started sorting those words. As soon as I started sorting those I was able to see that all of the top words appeared multiple times and so I was able to get some context from that. Like for this example, this is a article on metaprogramming and because of that, it has many references to method missing, to metaprogramming and to Ruby and now all of a sudden I looked at that data and I was able to see what that content was about. Then the next step that I did was I was able to go and I manually did this. I didn't try to build the machine learning system right away. I first tried to do this manually. I think that's a very important caveat. A lot of people don't get into machine learning because they think they need to have the final system built from day one or they have to start with the algorithms. I take a different approach. I try to take a very manual approach and I say, what would I do if I had to do this just by myself or if I hired someone to manually go and do this? So I went and I just searched for articles. Ones I felt were the most relevant and then I performed the same type of process there where I tokenized the words. I was able to just take all the content on the page, grab the most popular words on there and then from that point, what I was able to do was I was able to create, if you think of a Venn diagram, I was able to have and see the overlapping data points. So I was able to look and see that the most popular words in one of the articles also were the most popular words with these other articles and once I did that, I had the start. I had my base case scenario for what the system was gonna do and really that's at a high level, that's what it does. Now, automating that is where it starts to get a little bit trickier and that's where we're gonna walk through how that occurs. So as you can see in this diagram, we're taking tens of thousands of articles in the library. We're tokenizing all those. We're going through that process of being able to see what content is there. Then we have to create in that second circle there. That's where we create what's called a probability index where we see which articles are the most probable to be associated with the other ones and then at the end we just combine those. So let's talk a little bit, getting closer to the code. Let's talk a little bit about the gems that were used. The very first gem that I used was the naive Bayes gem, also N Bayes gem. It is just a regular Ruby gem and there's not a ton of magic involved with it. If you look up the naive Bayes algorithm, what it does is it's one of the more basic machine learning algorithms out there. If anyone here remembers back if you ever took a stats class before, maybe it was a little while ago. I know my stat class I took, I think about 14, 15 years ago. I didn't remember a lot from it but I did remember having to go through a lot of math problems. And when it comes down to machine learning, if you want to get very basic with it, machine learning is really just being able to take data and run that data in as efficient way as possible through these type of formula. And this is part of the reason why I do not like how people feel like they have to use tools such as Python or R for every single machine learning system. And it's because let's think about what we needed to do in order to build this out. We needed to run a mathematical computation. Ruby is great at math. Ruby can run this kind of calculation very efficiently. So right there, Ruby's still good. I needed to parse a lot of contextual data. I needed to go create HTTP requests, parse that data, and then bring it into the library. Ruby does an amazing job at that. Later on we're gonna talk about the background tasks and what those look like. Ruby does great at that. Needed to create a API application that could connect and send data back and forth between one application and another one. Rails does a Ruby does a great job at that. So there's not one reason why you can't use Ruby to build out this type of system. And so that's if you leave with anything today, you remember one thing from this talk, what I want you to remember is that Ruby can do all of these things. And one other thing that I've really felt is important on this space is I've been creating machine learning programs in Ruby for the last five or six years. Every once in a while I will run into a situation where I need to use a different tool. I may need to use a neural network or like Google's machine learning system or AWS's for some very large computational expensive kind of task. But what I found is that in many other scenarios that's overkill, it's not really needed. What I usually need is something like this where I run through some type of statistical formula and many gems are already in place for that and I just get the results I want. Just by doing that. So that's one gem. Now the next one is this graph rank gem. Now when I created my initial implementation, what I started by doing was just writing this type of behavior myself. This is the tokenization process and I got into it and it worked and then just through kind of researching it, I discovered this gem and this actually does what I was doing much better than I was. So this is the tokenization process. It also incorporates a few other very helpful tools such as being able to work with stop words which are gonna look and see what those represent and how they can be used. So I ended up replacing my own implementation with a call to this gem and it worked quite well. Those are the two main ones needed on the machine learning side but I also wanted to not just kind of give you a theoretical walkthrough of how that worked. I also wanted to show you everything. I wanna show you from the beginning to the end. So some of the other tools that were used were things like active model serializers. So what I used in this gem and the reason why I used it was because I ran into a bug very early on with this implementation. I originally got everything working. Got it on DevCamp, everything was working and I was trying to make open graph calls to each one of those sites. The problem is that started crashing the LMS. We started running into just, it just was too resource intensive and so I ended up moving all that logic where I went and I grabbed the metadata from all of those other sites and I moved it into this microservice and then I wrap it all up into a JSON API with the serializer. I use HTTP party for the API connections. I know some of this, if for the experienced developers here that's just kind of a common sensing but I know there's a very wide range of experiences in the room so I wanna make sure that I give you kind of the full nuts to bolts, everything that is included and then lastly I included Sidekick. This is important task if you've never built this type of system out before. Sidekick, if you've never used it before is a background task manager. Many of the processes that I'm talking about are resource intensive so if you tried to simply call these gems or call these processes, make these HTTP requests, crawl these sites and then perform the process of tokenizing and all those kind of things, you're gonna really not have a good time because your server's going to crash many times so what I ended up doing instead was I just created a bunch of background tasks for those long running processes, use Sidekick, it runs through a queue, it just simply gets done whenever it gets done and it adds it to the library. As we speak right now, the system is doing this so there are different processes that run all day every day, they're crawling the web, they're finding these words, they're creating the probability analysis as we speak right now, all of that is happening so now that was a high level overview, but I know that's only part of it, the fun stuff really gets into when we get into the code itself so that's what I want to show now and I will try to leave some time at the end also for questions. So this is the repo right here and this is a private repo because it's part of the dev camp system so I'm sorry I can't share the entire thing because this is a real product and we do have to have some level proprietary things or else I'm gonna get in trouble and so I'm gonna walk through, I added quite a bit to the read me though so that explains exactly the processes that occur and then I will go through a few of the code files that have some of the important data so part of this, the content crawler does a decent amount of the work where it goes out and it just, it performs pretty basic kind of crawling behavior, it goes out, we gave it an original set of data to work with so we gave it some original sites to go crawl and to parse to grab links and then it started to do its own work from that point so we'd send it to articles on medium and Stack Overflow and say Ruby documentation sites and the Rails guides and those kinds of things and it went and then from there it started grabbing those links and going out. Now to see how that works on a manual kind of basis you can imagine that you're going out and you're gonna crawl this site and so we have this little object here, an article that could be potentially set for recommendation, this is a source article, this is something on Dev Camp. It's going to call a content crawler service and then it's gonna do what is called a recommendation builder so it's going to go and it's going to set that recommendation so if I come over here we have a probability builder job so this is going to be a background task and this is where the tokenization and everything occurs so this is gonna go as you can see down there on line 17 that's where I'm calling that graph rank gem. I'm grabbing the keywords and then this is a very important thing if you're looking to build anything like this yourself if you're looking to use contextual analysis is that if you think of a giant bag of words which is really what page is there are a lot of words you don't want to, you just want to ignore. If you think of one of these articles and you try to compare it to any others you're gonna find that the words like the and a and all of these kind of words that we need in order to communicate are really not helpful for building context and so that's why on line 18 there you can see that I have a call to stop words so this is something that is pretty standard and pretty much any kind of natural language processing you want to ignore a bunch of words and in fact I think we ignore about 800 or so there's a list right here of all the popular stop words and just a basic Ruby string array and so that was part of the reason why I liked using that graph rank gem because I'm allowed to simply pass in the stop words ones I want to ignore and then it bypasses them. If I didn't have that in there every single article would pretty much look the same and be like yeah we have a 500 these in there and it's like this other one so no one's going to there wouldn't be any real recommendations but instead we stripped all those out I also customize that list just for our own reason so this is on the dev camp LMS I don't want the word dev camp in there so you can customize it it's not just words like pronouns and conjunctions and those kind of things that you can ignore anything you want so that was one process was building out exactly what we wanted to filter down to and then from there just called that graph ranker run and it tokenized all of them and so we ended up with this really nice set of words you could think of just like that diagram that we saw earlier where you have each one of those articles I have it boiled down to the 10 most popular words and from there I was able to get really nice context to see what that article was about and the results that we get on the system itself have been pretty good so I'm happy with the way that the probability system's working now on, oh I don't have internet so I should have kept that so moving down no that's fine I already had it I'm just gonna use it from the read me so from here moving down that was one article that was like a source article and then if you go down a little bit you can see that you had third party ones and I filter it so I'm using third party content and I also am using a Ruby enum in or Rails enum in order to have a really nice and easy way to query what are my source articles and then which ones are on third party sites because I don't wanna recommend on the LMS I don't want to point them to other parts of the LMS at some point we may wanna do that but for right now I want recommended articles to only give third party content so we filter past everything internally and we only point to outside content and then from there once so this first entire section what that does is that's really just kinda like what you can imagine the crawler does except this is the manual process and when I was starting to build this system out this is what I did I did not have the crawler right away I did not build the crawler until I was happy with the recommendations and part of the reason I got that idea from was from a book I really liked that I read last year called the Googleplex and it was a story from pretty much the beginning of the genesis of Google and how they built out the search engine the company and everything like that and one story from that book that really stuck with me was how they originally created the page rank algorithm and you would think that just because you see the final product now with Google that they had all these crazy algorithms and all these processes that were running and it just flat out works but I love the story and how it works they weren't happy with the results they were getting right away so instead what they did was they said we wanna get it working for one word we want the search engine to work for the word university and they didn't care about any other word in the world they just said let's get this working for university and they refined the algorithm they refined the page rank system until when they typed university in the system it brought back the results they were happy with and so I took that same approach here I didn't build out the tens of thousands of articles I started with a couple hundred articles that I manually pulled and I included outliers so I would pick out certain ones from different languages like you see here these are actually the real articles I used in the beginning so I would pick out a React article and then I'd pick out a Rails routing article and I'd pick out all of these from different languages, different concepts to make sure that I was getting the results I wanted in a limited testing kind of phase so that was my first approach and so I just manually built all of these and then once I was happy with it that's when I built the crawler that's when I started to say okay now we're ready to actually get content and then we built that out and that's how we have the library that we have now so the way that this calls the naive base system is I have a task that runs and the gem, and this is one thing I really like about this specific that in base gem it creates an object so the in base gem when you call it, when you instantiate it it creates this object and then you train the object so if you're familiar, you've done any studying at all on machine learning, you've heard of the concept of training, it means that you take all of your library of data, everything that you wanna use to learn from and then you train it so that's what I've done so I use that Rails enum, I loop over every one of the third party recommendations and I train the system, I do this just a few times a day and so I'm continually updating it and then from there I'm able to classify the system and so all of this is really inside of that gem it's one of the reasons I like it but as good as all that is and as good and accurate as it is one of my favorite things is how it allows this entire system to be very performant because you have to think this is stateless when I run this, yeah it works in the console it works in my testing but it's not really helpful in production because I need to be able to run this process I need to run my data or every time a page loads so whenever I have a guide load it needs to go check through the entire recommendation system it would be a really bad idea if I wanted to go through tens of thousands of articles and run all these probabilities on every single guide load so what this does is as you can see here in this saved file section it actually allows me to save the entire library the entire trained system into a YAML file so all of my probability indexes everything like that is in a YAML file on the server that continually is updated throughout the day and then that, if you look at the very bottom that's how you can load it so when the API request comes in and thousands upon thousands of these are happening on an hourly basis when it comes in it can just load that single file and it's a decent probably it's not it's still very small especially compared to running through data requests it performs no real computation you can simply load the file up and then it finds out where those probabilities lie and that's where we create if you remember earlier when we saw that Venn diagram that's where this happens when we load the file and then we run this through the classification model so every time that someone goes to one of those pages like that metaprogramming page or Rails routing or a React page or whatever it is that they're on that's what it's doing is it's calling this system it loads a file and then from there it just brings back the top three results ones that are the most relevant and as you saw earlier it does a pretty good job of that it has tens of thousands of articles and when it goes to a metaprogramming page it knows exactly what it's talking about and as you can see is hopefully is that we haven't really gone through anything too complex what we've walked through is been pretty basic in regards to the processes we're only doing about two or three things and we were able to build out a full machine learning recommendation engine that is also pretty performant just to take a look if you wanna see what that YAML file looks like so I'm not on the server I'm just local but this is just an example of what that KnifeBase gem generates so every time this happens this file gets overrided and you have just a basic YAML lookup where you have an ID in the database and then it has the tokens and each one of those tokens is sorted by the how many times they were shown on the site and then from there that is what the model looks at so it looks at that, compares it and then it picks out the three that are considered to be the most relevant now one of my favorite things when I see this working when I'm practicing it and I'm just playing with the concepts is when I think that I'll run into a guide I'll publish a guide and then I'll go and I'll Google the keywords I think are gonna be the most relevant for that guide so I'll just go manually do it and I'll find some articles obviously they're very relevant but I think it's pretty cool when I do the same thing and I see that article pop up and then I scroll down to the recommended articles and I find three articles that are actually better than what I personally manually search for that's when I know I had a win in it and I know that it's working properly is when it's doing a better job than if I were to manually be doing this myself and really at the end of the day that's really what this entire system is about is it's taking something that could be done manually and it's automating it and that's the goal of most machine learning processes just to give a little before I get into questions just to give a little bit of an idea on where we're looking to take this right now this is good for recommended articles and you can see this is something you've seen for years you go on Amazon and you see a product and right below it it says if you like this product you may like these other products and that's something that's been around for quite a while and it's helpful to build a system I'm sure many of you are in jobs where you might wanna build a very similar product what we're looking to do and this can give you some ideas on what else is out there and what else you can do to extend this we're looking to make the entire learning system dynamic so instead of simply giving recommended articles it will take a developer student that's going through the learning management system and based on their assessment so as they're taking quizzes as they are pushing up code it will start to build a student model very similar to how we built a recommendation model and then it will dynamically change the content so you have two students sitting right next to each other one of them is very good at understanding the topic they're learning the one right next to them is maybe a little bit more like me they're a little hard headed and they need to be told two or three times what the concept is the two students would even though they're going through the same program would have two completely different experiences the one on the right would be their entire sequence of curriculum would start to change all based on these kind of processes now that would not be possible manually but by leveraging machine learning that type of adaptive learning environment is possible because the system can learn and it can adjust and it can do it all on the fly and the real end goal of all of this is to make learning a much more dynamic type of experience and that's really in my, for my own personal goals that is why I love machine learning and it is also why I love Ruby because I think it is a great and a very undervalued type of tool in the machine learning space thank you all so much for your time