 My name is Grant Ingersoll, kind of a cheeky title, BM25 is so yesterday. If BM25 means nothing, you don't worry about it, I'll explain it, or at least I'll wave my hands at it, explain it. What I really wanna talk about today is kind of this genesis of what's going on as far as relevance in terms of search and text and all of these kinds of things. So there's a little bit of higher level talk, but I'm gonna show you some demos. We're gonna talk through kind of the day and the life of a search developer, at least the old days, as well as more modern these days. A little bit of background, I'm the CTO and co-founder of Lucidworks, started the Mahoud Project, been a Lucene Solar Committer for a long time. So let's get started. You're building out a search application and your users come in, maybe this is e-commerce, and they say, hey, I wanna find iPad cases. And you've been at this search thing for a long time, much like me. And so you might come up with something like this on the back end. You take in the user's query, you do some processing, maybe you throw in some synonyms in there, you start to muck around with ors and ands and Booing queries and kind of all of the bits that Solar or Lucene or whatever search engine gives you. And from that you then go do a search. And you actually hit Solar and if you hit a certain this document collection, it's kind of hard to read, but the number one result comes back and it's the Cleveland Browns folio case for Apple iPad. Yay, we got iPad, we got case, we got all of our right words. And then way down the stack here is pretty much that same case, but it's branded for the Minnesota Vikings, right? And so you're thinking to yourself, oh, well, the Browns, I mean, who the hell likes the Browns, right? And in fact, nobody on your site likes the Browns. Well, the Browns fans like the Browns, but everybody else hates the Browns because we all know the Vikings are the best team. And in fact, your marketing team loves the Vikings. And so does your CEO. And it turns out all your customers love the Vikings as well. So you say to yourself, that's not good, right? We wanna sell more stuff here. We wanna sell more Vikings gears. We don't want that at 15, we want it at one. So you start to ask the question, you know, your boss is yelling at you, marketing is yelling at you, everybody's yelling at me. What do I do? Right, well, so the first thing you do, right, is you go read the manual. In our case, the solar reference guide, really good starting place, lots of good useful content in there. And then you might dig in a little bit and you might do some things. Ah, solar and Lucene have these things called boosts. I can put boosts on my documents or I can add things at query time. So you might say, you know, when the document's coming in, I might say, oh, I see that this document contains Vikings in it. Therefore, I'm gonna boost that document by 100, just so that, you know, if you know anything about core, way Lucene does ranking, all of a sudden that's a lot more important document. How did I come up with 100? Finger to the wind, right? How did I come up with 10 here? Eh, I guessed, right? And so, start with those kinds of things and then you might say, oh, it's still not good enough. That Vikings document is still like, okay, now it's like 10, so at least it's on the first page. Marketing's a little bit happier and they're like, okay, Grant, keep doing what you're doing, right? And so then you start to dig in a little bit more and if this is not meant to be read, but this is what you might hit solar as debug query functionality and it'll spit out all of this math, essentially. All of the reasons why, all of the terms matched in certain ways and you can sit and stare at this. I used to think I got what I called explain blindness because there was a while in my career where I was doing these really deep relevance testing parts of my job and I would sit and stare at these things for hours on hours of all these different documents and I'd sit and try to tweak my queries and all of those kinds of things and eventually you just kind of go mad, right? So you're like, okay, well I'm tired of looking at those things so then what I do, I really go read the manuals, right? You might go get solar in action or my book or there's a new book out on relevant search. You might start to learn a lot more about how does this core search engine actually work, right? And you might dig in and if you were in Yannick's talk, you heard him talk about TFIDF, right? This is one of the core things that happens in a search engine. It's one of the things that kind of separates us from our indexing from the way a database does indexing or a traditional relational database does indexing and the reason it does is basically it's looking at kind of the yin and yang of your actual content. Term frequency basically says how important is this term in this document? Document frequency basically says how important is this term across all of my documents? You see, oh, well now I start to understand some of the factors that go into calculating that score and in fact, if you go back and you look at your explains, you'll see in there there's term frequencies and IDFs and in fact, we've seen these days throws in a bunch of other scoring factors, right? And in fact, the part of the title here is BM25 and if you really want to get geeky on math, you can go look at all of that formula. I'm not going to go over here so much just to let you know basically it's TFIDF and then we incorporate in things like document lengths and all of that kind of stuff because the intuition here is basically the more the term matches, the better as long as it doesn't match everywhere else and then all else being equal I would rather read a shorter document than a longer document, right? Intuitively, it makes sense and this is kind of based off of core information theory stuff that's been around for a long time. If you go back to the 70s, there's this guy named Salton who kind of started all of this and then there's a bunch of people like O'Cappy who built out on top of that so on so forth. So you can start to get an understanding of how Lucene is actually working and now you can start to really get into those knobs and dials and then of course, being the smart engineer you are you say, oh, well I need data, right? I need to know, do I really care about that Vikings document or not? And so you start to put in place a bunch of metrics. Maybe you work with your marketing team to say, hey, you've got all this stuff over in Omniture and you've got all these pretty pictures. Can I have some of that? I wanna see what users are actually clicking on and so from those things you might start to build out a relevance practice. You go read that book on relevant search. You'll see that there's a lot of good advice in there about how to think about relevance overall. This is kind of distilling down those things. You might put in place and learn some additional math around calculating what's relevant, what's not, all of that kind of stuff and lather rinse repeat. And I think you might sense, this is a lot of work, right? I've made a whole career off of this. Many of you in this room probably have made a whole career off of all of this really fine grain, solar tuning, et cetera. But then really you have this light bulb of an idea and you say, what would Google do? Not what would God do, what would Google do, right? And you would say, oh well, I need to learn from the wisdom of the crowds. All my users are inherently telling me what to do with this content, right? And so from that then I might develop a little hamburger diagram that talks at a little deeper level around how I start to think about solving the search relevance problem. And it really starts to put things in context a bit more when you think about this core Lucene stuff, BM25, all of that boosting and document query boost and all that kind of stuff. That's essentially the first half or so of solving this relevance problem, right? Because the reality is, is there's a lot of other things that go into actually returning what your users care about, not just what the engine is scored and says this is the most relevant thing according to some information theory score, right? So if you think about it like core Lucene brings you this, that kind of low level matching, it's really fast, it's really efficient, there's a lot of options there, you can go to town. And so you can move this up and down here. Things like solar come along and they add a whole bunch of stuff, like things that help discovery, essentially help guide the user better, things like faceting, spell checking, hit highlighting, all of that kind of stuff. And then you're always going to have a layer of just hey, this is my business, right? I'm in e-commerce, I'm in finance, I'm in healthcare, et cetera. There's always gonna be things that I need to know about, right? But these days then the really interesting stuff that's going on and kind of the last mile, if you will, although you'll never truly get to the last mile is around machine learning and all of that kind of stuff, right? So I'm gonna assume for the rest of this talk that you pretty much know most of these things down here and focus on how do we integrate and overlay all of these things into an application, right? And I'll give a little demo about that here in a minute. And of course, just to always keep things in context, that last sliver there is your escape key for when your boss comes in and says, oh, what about this one? He says, you know, it's really not perfect. I wanna, I'm never gonna be super great at that or I'm never gonna be perfect at this. So I wanna, you know, you always have to have this kind of expectation setting going on. And that last piece also is really where those metrics that you put in place help save your day when your boss is really mad because you can come back to him and say, yeah, I know that query is really important to you, but guess what, nobody else cares about that. So if your boss happened to be the one Browns fan in the entire company and he's like, I want the Browns top. You could say, yeah, boss, maybe the Browns should be at the top, but here's the data, right? I've measured all this stuff. Nobody cares about the Browns. Everybody likes the Vikings, right? So that's kind of how I think about at a high level this stuff around relevance. Pivoting it a little bit then as you start to think about how do we kind of create a picture around capturing data and interacting with users on these things. And so I tend to think about this stuff in the context of what I call the three C's content. This is a lot of your core solar stuff. This is thinking about text matching, faceting, all of that kind of stuff. And then the, what would Google do? Aspects of things, how do we leverage things like click streams or feedback or crowdsourcing, all that kind of stuff. And then last, and this is really a pretty, a lot of emerging things for a lot of people these days is how do we incorporate more context? Things like personalization, recommendations, user market segmentation, all of those kinds of things. And really at the end of the day, relevance then becomes this holistic view of not just your data, but the way your users interact with that data, as well as what things you care about per user or per company, et cetera. So what does that all look like then? We start to dig in a little bit and think about the real world here. And I'll show you all of this stuff in action here in a minute, but documents coming in, we're gonna focus on some of the things that you might do on the indexing side to help at query time in terms of developing a deeper understanding. So the first things you might do, if you've built machine learning models, things like doing named entity recognition, identifying topics, maybe identifying anomalies, clustering, WordDeVec, et cetera. As your content's coming in, you're basically taking that content and you're applying labels to it, you're extracting content, you're marking it up, you're annotating it, all of that kind of stuff. This is all pretty straightforward. You might then apply whatever rules you have. You may still want that Vikings boosted by 100, and then it all ends up in solar. So now in solar, I've got my core catalog, if you will. And that's all great, but what we often then see in the real world is there's this lather rinse repeat cycle that applies to your data too. So you wanna think about, how do I make the content here, and I'm gonna kill myself on this thing, how do you make the content here smarter? And so the next step then becomes something like offline or these days you can even do some of this streaming as well for certain operations. In my world, what I often do is you load it in the spark or some other engine like that. You start to think then about building your models. There's kind of this inception in here. Well, you saw I had all these models here for built across my content. Well, I had to build those models too, right? So I'm often building those models off of that core content. So one of the things I often do is just solar's really great at doing this. We've done, as a company, we have a project called the Spark Solar Project that allows you to treat solar as an RDD or a data frame in Spark. So then we can go and use all of these tools that are in Spark for doing things like Word to VAC and PageRank and Topic Detection, all of that kind of stuff, all right? And then here's a really cool thing, a nice really simple world. You can just put all that stuff back into solar as well. So I've short circuited all of this stuff around Hadoop and HDFS and all these other pieces and I'm just using solar for all of these things. It just happened to be in other collections, right? We call them sidecar collections, but you can really do whatever you want or call them whatever you want there. They're just yet another collection or core or whatever the nomenclature is in solar these days. And the nice thing then is, you know, you kind of on the indexing side now have this model, especially with the way solar supports updating now, you can often do very fast, like you might run this offline thing every hour, every day, that kind of stuff. It really depends on the size of what you're trying to do, how much of the data you're looking at, so on so forth, right? So that's the indexing side of looking at the bigger picture around relevance. I've done a lot of enhancements, in other words, to the core content at this point. So then we think about, okay, I've actually got this stuff up in production, I've got my solar catalog going in and this is really where you start to dig in on the relevant side of things. And one of the first things you wanna do is think about what is the user actually asking you? What is their intent, right? So if you think about like a core search engine like solar, if you just hit a query against it, it's just gonna go look at all the statistics built into the indexes already, but it's not gonna take that step back and say, what is this user actually asking me, right? So in the case here, I think about query intent as kind of being at a couple of different layers. At the strategic layer, it's like, what is the overall goal of this user's query habit, right? Or whatever they're trying to do, right? Are they actually looking for a product? Are they looking for support on the product? Are they looking for help from you as a company? So you can kind of think about it as that query intent is the thing that's gonna then guide my decisions downstream, right? That's the strategic level. The tactical part of understanding query intent then is like essentially attaching labels to different parts of the query so that you know what roles they're gonna play downstream. So iPad case, for instance, might be the overall strategic intent you think is find me a product, right? Or find me an actual known item. And then the tactical aspects, iPad would be a brand or a type of that product in case would be an adjective accessory, those kinds of things. And so now you can label those things. And if you had a more expansive query like 32, Apple 32 gigabyte iPod or iPad 4 or whatever the latest version is, you might break down that query even further. You might then also go look up some of the semantics or the meaning behind that query. You might use like a knowledge base around your products, around its relationships to brands and all that kind of stuff. If you were at Yonix Talk earlier, this is a great place where you might use some of that relational mapping that he was talking about or the graph traversal mapping to then better understand what's going on between those things. Because ultimately what you want here is the ability to then shape your queries differently depending on the outcome of that intent. The next thing you probably want to do is you remember you answered the question or asked the question, what would Google do? Well the thing you want to do here is essentially take all of your prior knowledge around how users who have asked this question and use that to inform the opinion of what you're going to do for this specific user, right? In my world we call these signals, things like clicks, things like add to cards, things like the fact that we're all in this talk right now. All of those things could be signals. Any individual signal by itself isn't all that important. But once you start to look at the aggregate, they get to be a lot more interesting. In fact one of the things we just do is we put all of this stuff back into Solar as well and yet again another sidecar collection here. And I think you'll see during the demo this stuff really comes out and plays nicely because all of the things that you know about deploying Solar, you just continue to do that and all the queries and all that stuff all just work. And they work on this data just as well as they work on your core content. And then you might bring in things like your user factors, et cetera. So this is really starting to get at the context. So you think about it here, collaboration, collaboration, context. And then of course we've got to go do, we might apply any specific rules that we have. That's really up to you and what your company goals are. And then all of that then leads to Solar. We're gonna go do an actual query against Solar. And at this point you kind of think about what Solar's doing is saying, I wanna go get a coarse grained set of documents that are going to be candidates for answering that question. And again if we look at that hamburger stack if you will, we're asking Solar right now to go do the core information theory ranking. Although what we've done is also overlay into the query some other factors. And I'll show you kind of the rough shape of what a query looks like in this world here. In a minute we go to Solar. And then here's where it all gets really cool when we start to say, there's been a lot of really interesting work on using machine learning in what's called learning to rank. This is not something that's just done by Solar. This is something that Google and Bing and all of the big companies, there's a ton of academic research out on there. One of the cool things is, and this truly I think sets Solar apart from any of the other open source search engines as we have, learning to rank all built in, support for a lot of different models, all of that kind of stuff. And so what you typically see happen is I've got my coarse grained candidate set here. And then what we do is we apply Solar's re-rank function. If you're familiar with that, what that is, it basically says, okay, go get me a, my coarse grain might say, give me a thousand results. The re-ranking then just applies to those thousand results. So if you think about it, this thing is a really expensive interloop calculation. This thing, because you're going across all of the content potentially or all of the content that matches, this one then, it's kind of reversed. You have a much smaller set of content, but the re-ranking calculation is often much more expensive because you're applying machine learning and all of that kind of stuff. If you look at Solar, it kind of has this straightforward thing called the rank query. The learning to rank stuff, which I think Christine here helped build or some of the people on your team helped build. So you can keep me honest here. You can then apply a bunch of different machine learning models that you've trained offline, perhaps in that offline task I talked about during indexing, but this time it's really off of your click streams and things like that. You can apply learning to rank on at this point and basically redo those top things. And if you look at really top tier search companies, your Googles, your Amazons, et cetera, they often have lots of re-rankers going on. In fact, I've seen or heard cases where people might have a model built just for one versus two. You might have model built for top 500. You might have then another model for top 50. You might have a model for top 10. You might even then, like I said, have a model for one versus two. It really kind of depends as a business how much do you care versus one versus two. Most studies show that people, there's a bias built into people that they click the first result almost regardless. Like you can twiddle one versus two and people will choose one more often than not, more often than you would expect, given the fact that they saw both of them, right? And so all of those things kind of factor in here and Solar has some really great functionality around both learning to rank as well as these re-rankers, right? And so you can build up a much more sophisticated model of your search and then of course, the last but not least, you might wanna then transform your results on the way out. Solar has a lot of nice capabilities for those things. You might also do some things downstream here, right? And one of the key takeaways is if you think about Solar, prior to adding, for instance, learning to rank and the re-rank query stuff, it used to be the case that, like especially Solar 4, Solar 3, et cetera, you would hit this course grain thing, you would pull 1000 plus results out of Solar just so you can do your downstream processing on it. And that was super expensive, right? These days, with things like the learning to rank, with the graph traversal stuff, you can apply all of those security things, all of these re-ranking factors, much more close to the data and sometimes even right in that inner loop. And so at the end of the day, you're shipping less data around and so everybody's happier, right? So that's the query side. The last bit here, oops, I've got a little mess up on my slide here, is thinking about what users are doing and one of the things I like to just do is capture all of that stuff right back into Solar. So these are those signals and what I'm doing here is basically every time a user clicks, I'm going to index those into a sidecar collection in Solar. I'm then going to, I wanna look at all of these clicks and things across the whole entire collection. So I'm gonna load them up into Spark. I'm gonna build up those clickstream models that we talked about. I'm gonna run some query analysis stuff. I'm gonna build things like recommenders and all of that. Yannick talked earlier about how you could do a really simple recommender with graph, with Spark ML and stuff like that, you can build world-class recommenders on top of that. So you might wanna then take all of that signal data, join it with your core catalog data, which is also in Solar, great. You don't have to move data around outside of your system and then just output those models right back into Solar. And as you'll see here in the demo, these are just, again, Solar documents. That's really cool. And then, of course, that feeds into the query edition of all of this. So I think now you can kinda see there's this full cycle, right? I've got the indexing going on. I'm enriching the content as it's coming in. I'm then serving up that content and I'm doing a bunch of things to try to capture and gauge how users are interacting with my content. And, oh, by the way, this doesn't have to just be at the consumer side, it can, or on things like e-commerce or website search. You can do these kinds of things behind the firewall as well. And then the user side of it is really just using, again, Solar to enrich your understanding of what users are doing with it. So with that then, and this is kind of a conceptual view of how I think about queries. There's some caveats on it with this. First off, Solar eDismax does a pretty good approximation of most of these things. This is not meant to be an exact solar query so much as just kind of a conceptual model around it. But remember, we are trying to do a couple of things. One, we care, usually we care a lot about precision. We want the best documents at the top, right? In fact, if a user is looking for a specific known item, it better be number one or number two or else you're in trouble. Your boss is gonna be mad at you, right? Sometimes we care about things like recall, right? So you might, for instance, for our traditional solar-based precision kinds of things, you'll do things like exact matches or sloppy phrases and queries. All of those things are designed to boost the best matches up to the top in terms of core information theory. You might throw in an OR query there just so you never have zero results, right? And all of these boosts here, by the way, are things that you can either, in the old days, you would sit in futs with yourself. In the machine learning days now, you would probably try to learn what those boosts are, perhaps. And then, this is where you start to get in to all these other factors I'm talking about, is you can literally take and just start to add onto your original query things like, for instance, using WordDeVec to get synonym expansions or get conceptual mappings for your queries, right? If you're not familiar with WordDeVec, it essentially takes any given word and builds up a set of relationship. It looks at all the content around that word and builds up a set of co-occurrences or a model around co-occurrences that then let you know what are common terms that co-occur with that original term that you're looking at. That's a really rough, hand-wavy view of it. There's obviously a lot of math underneath, but that WordDeVec model can then be used, for instance, whenever the user types iPad wrong, you might expand it, you might detect that, you might look at, oh, anytime somebody types iPad, they also type Apple, so you would apply an Apple filter on there. Things like click stream, all of that stuff. One of the things we typically do here is we go hit that signals collection that we've built and we get the documents back that are most clicked on. More or less, we have other things that we do with it, like time decay models and all that kind of stuff. So we add that on as a boost query in solar parlance. You might look at, for instance, another common thing, especially in consumer-facing search, is your query distribution is often a power law distribution. So you might do some things with understanding relationships between the head and the tail, right? If you're familiar with that. What you often find, for instance, in your tail queries and consumer-facing sites is there's often a pretty rich relationship to the head. The head being your most popular queries, the tail being, you know, variants, right? So for instance, you often see in the tail, people don't know how to spell, right? Or the head query might be iPad. The tail query might be Apple iPad 32 gigabytes with a pink case and streamers on it, right? That could be a tail query, right? Or you often see in your tail queries spam, stupid people, teenagers doing dumb things to your site trying to break it, SQL injections, all of those kinds of things, right? And so you can do work here to, you know, and again, this all comes out of this clickstream data. That's the really cool thing. So you can do work here to start to understand what those relationships are. And then you can also then inject personalization. Again, you know, the really cool thing here is this click data, that's like your goal, right? Because the same data that makes for better search is the same data that makes for recommendations. It's the same data that makes for personalization and it's the same data that makes for analytics. I can't tell you a number of organizations I go into where that's a big convoluted mess just to understand all of those things. We just take and put it all in solar and life is good, right? And then you just add on the spark side to process it all, right? And so you can start to inject whatever personalization things you wanna do here, right? So if you flip some of the calculations that you do on that clickstream, for instance, you can then have a personalization recommender, you know, basically give me weights and boosts for this user instead of just for this query, right? The caveat here is, you know, check your biases at the door, you know, make sure that you're, you know, that things like there's a lot of biases that just, and biases I don't necessarily mean in like the human negative way. I just mean in terms of stats and all that kind of stuff. The one versus two thing, and there's a lot of other biases you can go read about in the academic literature. And then the last bit here is I also maybe wanna inject a learning to rank model. I haven't talked about that a lot, but the idea here with learning to rank is that instead of going off of this kind of statistical view of the tokens and things like TFIDF and document length, you're actually training models that are built on top of features that you've chosen in the core content, right? You know, or that you've injected into the system. So for instance, you might use price as a feature, you might use the clicks it's received, you might use the core content, so on and so forth. And you're basically letting solar build a model that then can be leveraged even in the absence of things like click stream, et cetera. So it helps you overcome stuff like the cold start problem, et cetera. It helps you deal with new content as it comes in because the ranking model will look at all of the features of the data, regardless of whether it's seen that before or not. And then other things you might wanna do then is also inject in any filters or options like that, security preferences, categorical, like if you know a user's in a specific category, only filter on those and all of that kind of stuff. So again, that's, you know, your mileage may vary view of the query. And in fact, the reality is these days, if you look at best practices around search, it's all about experimentation when it comes to relevance, right? And the really cool thing about experimentation, you building that as a search engineer is that you get to be wrong, right? And that's a really liberating feeling as an engineer because now you're being data driven. Most people don't know this, but you know the likes of Google and Bing and a lot of the big search engines, they run thousands of experiments all the time. You know, most people, when you go to Google, you're not even in the best version of Google that, you know, Google things, right? You're in some experiment. The experiment might be they change the font. It might be they fundamentally change their entire scoring algorithm, right? And they're constantly measuring what everybody's doing. So it's, you know, to mix all of my metaphors or cliches here, it's mutual fund versus stock picking, right? The idea is I wanna drive down the cost of experimentation such that I get an overall gain over the course of, you know, a lifetime of this system as opposed to having to be that one right system. So this is where things like AB tests come in or multi-armed bandits. I'll show you some of what that stuff looks like. Ronnie Kohavni has a really nice talk. He was the head of experimentation at Bing. If you wanna go read some more, that's a really good talk on this stuff. At the end of the day, you know, I kinda look at it as it's okay to have rules. You often need rules. Sometimes you just need that one result to be the top, right? But you wanna think about them a bunch more rigorously. You know, I can't tell you the number of places I've gone in where company has thousands and thousands of rules. And I'll be like, oh, what does that rule there do? And they're like, oh, well, Joe wrote that one. I'm like, wait, who's Joe? I've never met Joe. Oh, Joe used to work here like five years ago, right? And so that rule has just been in the system and nobody even knows or cares what's going on, right? So again, you know, you've got all your metrics in place. You've got all these cool things for capturing what users are doing with it and you're putting them all back into solar and then you're gonna go run a bunch of experiments, right? So show us, right? This is cool stuff, right? So we've built this. That's our whole architecture for doing this at Lucidworks. This is not a talk about our fusion product. It's a talk about solar and spark. So what I'm gonna do is show you how some of these things all fit in and we'll talk about the process around what I've done. We've built a lot of it into our product, but we build fusion on top of solar and spark and all of this kind of stuff. So there's kind of some key pieces at play here, right? At the end of the day, what you have with solar is a really awesome ranking engine, right? And especially things like streaming expressions now and all these data analytics workloads that I can do, you have a really good engine that's really good at kind of all this neighborhood style calculation, right? And that neighborhood can actually be quite large. You might wanna do analytics on a million documents or 10 million documents or any of those kinds of things, but it has all of the kind of plugability and things you need for that real-time query addition side of the world, right? Plugable re-rankers, the learning to rank stuff, it's multi-tenanted, all of that kind of stuff. Multi-tenanted in our case really helps because all those sidecar collections, we can take and route content into them without having to fundamentally change our operational footprint. We're just deploying more solar. And then if you take spark and you put it right next to it and you're smart about when and how you pull data out of solar up into spark, you can start to then look at the whole big picture, so again, if you think about it here, over here I'm putting all of my core catalog, I'm capturing all of my signals, et cetera, et cetera. I then can pull all of them up into spark, do joins on it, all of that kind of stuff, right? And things like Spark ML are available to me. There's lots of machine learning packages that run on Spark. Unfortunately, Spark ML has some shortcomings in serving up those models, but there are tools out there and capabilities. We've built some, Flipkart has some, a bunch of other people have them for actually bringing Spark ML models into production, but there's also other projects like Deep Learning for Java and H2O and Mahout and so on and so forth that give you a lot of these tools for doing the math for building machine learning models, right? All right, so demo. This is the one you can try at home. In fact, this demo was the genesis for our product. Best Buy put out a product catalog plus a month of queries. Some of you may have seen parts of this demo before, but it's much more interesting these days. What we did back three plus years ago, this Kagle competition was like, hey, can you improve on the relevance of these documents? They gave us their catalog and then a month's worth of clicks. And one of our engineers sat down and within three days, we were number five in that competition without even doing anything. It was just like dead simple brute force solar queries. Just being like doing like taking those signals, doing some little bit of faceting and stuff like that and then injecting those signals back into the main query. We're like, ah, we're onto something here. When we had seen bits and pieces of this, if you familiar with like Trey Granger at Career Builder, he's done a lot of talks on recommenders and signals there and there's a bunch of other companies that have done this. So the key pieces here then is I'm gonna show this on what's coming out with Fusion, the key bits for us here, Spark, we're running Spark 2, 2.1 or something along those lines. I forget the exact number. And then this is Solar 6.5. And then we built a little UI on front of it. So let me then, I'm gonna try to do this over while looking over my shoulder here. And I know my session is timed out. So I'm gonna just start with the UI side of this and then you'll see progressively I'll build out all of those little pieces. So first thing coming in is a very simple three page site. We have a landing page, we have a search results page and we have a product detail page. Couple of things to note here. Right off the bat, we're talking about some of these personalization factors here and some of these A, B tests. And in fact, I'm not sure what's going on with my screen size there. I'm gonna make that a little bit better. One of the things I've built in here at the top is A, B test. Right, if you notice the first one was showing tablets. The second one is showing TVs. I've got a little multi-arm bandit running here that basically as users interact with this set of results, I send a reward back into the system into Solar that says, ah, somebody voted for TVs. Ah, somebody voted for tablets. And that multi-arm bandit will automatically kind of level the two. Still gives the loser a chance to perform. Maybe it's just having a bad day, so on and so forth. And that can be done quite extensively. And all I'm doing is when that query comes in is I'm just taking and injecting, I'm looking up in my bandit what the properties are that I should inject based off of its view of the world. And then I just inject that into the query. Down here, that signals data. I've taken and built a recommender. I took those signals, I ran them through what's called alternating least squares, which is a way of building recommender. And I've put those documents back into Solar as well. So I've essentially hydrated all those recommendations into a Solar collection. And so I'm doing user for item recommenders here. And again, all this is, is hitting Solar. Saying here's my user, the query in this case is the user, the user ID. The results then are the documents from the product catalog. If we come up here, let me find my cursor again. I can do a search for iPad. And the thing I'm doing here first is just hitting bare bones Solar. And you'll see here in a second, I'll show you what this looks like in terms of the actual query. You're all search people in the room, I think. You know why these match. You can see there's iPads in there. Let me make that a little bit bigger. You can see there's iPads in there, but things like the actual iPad are down here. And I could go and do lots of queries and you'd see the same kind of effect. I essentially did some basic tuning on titles are more important than description matches. This data set has a thing called like a sales rank on it, which is some offline calculation based off of revenue. I'm using that as part of my query, so on, so forth. But I think you would agree with me those results aren't that good. So let's go in and look at what this looks like in terms of the first thing I'm gonna do is just select this collection. And I'm gonna make a change to a different way of processing those results. I'll show you what those look like here in a second in terms of how that data is flowing in. I should probably hit save. You can see, for instance, I've got a lot of different ways of processing the content. That's kind of part of this notion here as well, is that I'm going to do lots of experiments. I'm gonna come back and good Lord willing live demo. You can see now these results are a lot better, namely because what I've done here, and I'll show you what this pipeline looks like, that again, that flow that I showed you is query came in. I first went to and did some intent classification or not, sorry, I didn't do intent classification yet. I did some signals, I brought some signals on, I applied some rules, and then I did that query to solar. So essentially here, the query is the clickstream stuff plus core solar, and then you can go from there. Few other things to note, this related searches, that's just a recommender off of that same signal data as well, so the input to that recommenders are the queries against themselves. So I did user versus item, now I'm doing query versus query, and the last bit of that trifecta is if I actually look at the product detail page, that same data can power your item to item recommenders. Again, the same data that makes for better search, those clickstreams also makes for personalization, recommendations, analytics, et cetera. You put all of that stuff into solar and now you can play around with it as much as you see fit. I'm showing you the stuff in Fusion, but just because it makes displaying it a lot easier. Here's my first pipeline. This is in our tool called the query workbench. At the end of the day, what it does is query solar. So this is essentially sending that request to solar. So your core content comes in, it kind of goes from the top down. So for instance, this is the original first solar based query. You can see for instance, if you're familiar with Dismax, I've done things like I'm searching across things like category names and titles and descriptions and I've got various boost factors. I could spend the next month tuning all of that stuff to get those things up to where I want them. So on so forth, things like facets so on from there. Let me make it so I can actually see my full screen here. Let me switch over to that other one that I showed you. In this case, the one with signals. And here this pipeline is a bit different. You can see down here, I've still got my core solar. Still keeping that same. I don't have to, but for the person that demos it's good. What I've done first though is go hit my signals. And in fact, as you can see here, I just am hitting another collection in solar. And I'll show you what that looks like here in a second. And then I feed that down in. I do some basically calculations on how much do I care about those signals? How important are signals relative to my core content? And then we do things like I've got some rules stuff set up. In fact, our rules are also stored in solar. And then I do my solar query. So that's kind of the next evolution. And then we can kind of continue out with this. If you want to see the full bit here, I think it is this one. Similar pipeline, but now you can see a bit more advanced. We got the requisite marketing term of machine learning. In this case though, what I've done is built offline. Query intent classifier. I can show you what that spark code looks like in a second if you want. And query comes in. And what I'm trying to do with this classifier is predict which category the query applies to. So maybe I would then use that to apply a filter query. So for instance, my next step in my pipeline here is if the predicted query is a category other than other, other than inject a filter query on category names. So in fact, if you come in here and I do iPad, it's now doing that classification. It's restricted down the results, not only by iPad, but also by into the category that was predicted by that model. And so for instance, if I turn that off, if I can actually hit the screen, you would see that we're not doing that restriction anymore. And then down the stream, I've still got my signals, I've got my rules, all of that kind of stuff. Looking at rules, I mentioned those are, oh sorry, and then the last bit in here is I do have learning to rank in as well. This is a little bit hard to read here, but basically I checked to see whether I want learning to rank on or off. And then I'm doing what's called a rank, a re-rank query. This is a parameter on the query parameters for solar. And I think I may have this open somewhere else here. You might build a little bit of an eye chart there. Let me find it. This is what the actual query to solar looks like. You can see it's a lot richer than your traditional model. There's my LTR query right there, if that is legible in the back of the room. Everybody see that? So in this case, I'm basically churning on LTR. I'm telling it to hit the e-commerce model, which is something I built and trained offline. And that I only want to apply re-ranking to 100 documents. You could do 1,000, you could do 50. It really depends on how good that model is, how expensive it is, all of those kinds of things. All good so far? In case you don't believe me on the, this is all just in solar. My rules, here's my rules. So for instance, I've got a rule that anytime a user types in customer service, we redirect them to a specific page. Think that should work. Let me try it. Customer service, I probably shouldn't. You can see, oh, I got a little, that's not happy, that's a bad demo. Let me try a different one. Ooh, I might need to kill that. Live demos. Let me try another one. I'm pretty sure I have Black Friday in here as a rule. And that puts up this banner. So for instance, if I come and look back at my query in here, one of my documents is a banner that says, anytime somebody types Black Friday, inject that rule, right? So those are my rules, no big deal there. Just, this is a cool thing as one of our, my co-founder says, it's just search, right? You wanna see the recommenders. Here they are. A lot of convoluted IDs and all that. Really simple. Offline, we ran a ALS model. We output then back into solar, the item ID, and the recommendation and await. That then goes into your main query. And now you've got waits, you just sort by similarity there. In fact, let me just do that. And you've got all of the necessary bits for a recommender as just a simple solar search thing. Here's the really cool thing about this, right? If you know anything about collaborative filtering, you know that it fails to understand the content, right? And if you know anything about content-based recommendations, you know that it fails to understand what users do. With solar, you now have a very scalable, very fast, multimodal recommender. And you can kind of tune the knob. How much do I care about content-based recommendations, AKA search, versus how much do I care about user factors, AKA collaborative filtering? And you can then make decisions, especially if you choose query intent, you might say, oh, my query intent model might say, oh, this user, this query is about based off of what I've learned of the system, and I know that what other users have done with this data is gonna be more important than the content, and you might automatically move that knob based off of those kinds of things. A little bit better view of the data, in terms of the recommenders, oops, sorry. Let me hit the query workbench here again. A little bit better view on is the related searches button. So if I come in here and I do item ID, colon iPad, and I sort by that similarity score again, you can see here, in this case now, my pivot is, my item ID is the actual query, and the other item ID is the suggested query that goes with it. Again, just search, right? And so you can apply all the things you know about solar to this kind of data as well, your filters, your facets, et cetera, right? And then that's why I say you can also power all of your analytics off of it. So that's kind of it in a nutshell. If you wanna see what these signals look like that I captured, the raw ones are here, and they're what you might expect, although they're kind of stripped down, but they contain things like the doc ID, what filters were applied, they contain things around what the events are, et cetera. When they happened, you can see those data sets kind of old, the query for this, the click was back in 2011. You can see in this case, the user search for Green Lantern, basically it's saying the user search for Green Lantern, they clicked on this document and they had these filters applied when they did. So that's that core of building out all of these learning to rank models, so on and so forth. The last bit here, we talked about all the training data. We've got that in solar as well. So for instance, for that query intent model that I built is just, I did a little bit of Scala code. I'll show you that here real quick, but I basically did query to category. I did a bunch of feature selection up front, but that's done in the code and output those results back to solar. And in fact, let me find, here's my create training job. Let me scroll down here, make that a little bit better. This is basically, sorry, not scrolling very well here. It's too hard on these cross screen ones, but basically I've got a bunch of code that does, let me just walk through real quick, a little bits of parts here. But basically this first thing says connect to solar and then I just load it up into Spark as an RDD and then I do some massaging of it and then at the very end here I write it back to solar. All right, so that's all of my training there. Things like the random force classifier that I built, that's just running across Spark ML as well using its random force implementation. And you can see like I've built it off of that training collection that I just did. You say what filters you want, all of that kind of stuff. Using solar to generate your test and training sets for machine learning, so on and so forth. So, that's the demo, I promised all of that stuff. I think I covered everything ranging from basic solar, kind of the state of the world for most people for tuning solar with your boosts and all that stuff. We covered clickstream, learning to rank, personalization, bandits, et cetera. I think you've got a pretty good overview of all of the things. All that stuff was built out on top of solar and Spark and then the machine learning algorithms, all that kind of stuff. There's lots of good open source out there. I think we've got five minutes left. Any questions? Stun silence? All right. Well, the Bloomberg folks have a really nice talk on this from Solar Lucene Revolution from two years ago. I think Diego and forget the other gentleman's name. The docs up on solar or the RTM or RTFM is right there. That's really useful as well. If you want to spark solar stuff, Spark, Lucene, all of that kind of stuff for those who aren't familiar. And up at the top is how to get a hold of me. So, all right, appreciate the time.