 So how many guys have heard about Prakto? Awesome. How many guys have tried online consultation at Prakto? Nice. So for those who don't, consult is an online consultation platform where a user can post a health query and we as a platform connected to a doctor who replies back with a medical advice. So in such a platform if you want to optimize for a user, you would often assign all the questions to a set of doctors who are performing, who are answering a lot. But over a period of time, this set of doctors who usually are a very small fraction will get a lot of questions assigned and they won't be able to answer all of them. They would feel burned out and at the same time the rest of the doctors would get very less questions assigned and they won't be engaged with the platform. So if you try to optimize for the users, you would end up with a bad experience for the doctors. On the other hand, if you try to optimize for the doctors, you would randomly assign the same set of same number of questions to each doctor. But not all doctors are equally active. Some don't even open their app to see the questions that were assigned. So those questions would obviously end up being unanswered and would impact our user experience. So if you optimize for your doctors, you would have a bad experience for your users. So usually you come across such kind of trade-off at any marketplace, even for e-commerce sites like Flipkart or Amazon. If you try to optimize for buyers, you would have a bad experience for sellers and vice versa. So in this talk, I present how we use Bandit algorithms to actually solve for such a trade-off problem. The key takeaways from this presentation are to understand how multi-amp Bandits work and how you can use at any of the problems that you work on and how we used at Pratt or Concert. So let's begin. So let's say we have a guy who just came to Bangalore and he wanted to have a good dining experience. But he has no idea which restaurants are good and which are bad. So he asked a couple of his friends and they said a couple of restaurants. So he went there and he liked them. So he made two buckets, one bucket where the restaurants he liked, the other bucket which he never explored. So the next time he wants to go to a restaurant, he wants to play safe and goes to those restaurants which he liked. So more often than not, he wants to visit such restaurants, let's say out of ten times. But he gets bored and he wants to occasionally explore new restaurants. So he goes to one of those unexplored bucket and if he likes them, he adds it to the like butter, like restaurants. So what we are seeing here is a multi-armed bandit setting. The restaurants on the right side are your multiple arms and the reward this guy is trying to optimize is to have a good dining experience. And the way he's trying to do that is through an exploited explore strategy. So he's exploiting nine out of ten times and exploring one out of ten times so that he would not miss out on those new restaurants. This is the simplest form of multi-armed bandit algorithms. It's called as epsilon greedy. Where epsilon stands for a probability. Here epsilon is 0.1 means that 10% of the times he explores and 90% exploits. So this guy really liked this strategy. He started exploring all the restaurants after experimentation. He was no more a stick man. So coming back to has a lot of use cases for patients like you can search for doctors clinics or hospitals. You can even book an appointment or you can consult online or also you can order medicines. So the one use case where you're looking here is consultation online. A usual consultation form looks like this where you post your problem. Also you select a problem area in which specialty you want this question to be asked. So we only send this question to those doctors in from that specialty. Once the question gets posted we now have the task of assigning to a doctors that we have in our list. So let's say today we have with us doctors Shyam, Alexa, Rahman, Bob. So if you can see Bob looks like a surgeon looks like to be more busier than other doctors. So somehow we have figured out their timings. And if you look at the availabilities Shyam is available on weekends, Alexa on evenings, Rahman mornings, but Bob has a lot of surgeries scheduled for that week. He's just busy for that week. So let's say this question comes on a Wednesday morning. Even if let's say hypothetical Shyam is the most performing doctor we won't assign it to Shyam because he is not available on the weekdays. So we assign this question to Rahman instead because he is available on mornings. So this kind of using the context is what we call as contextual assignments. So how we do that for every question we build what we call as a contextual feature vector. The features can be the number of words, the time the question came in, the age and gender, etc. And we multiply that contextual feature vector with parameter vector of each doctor to get what we call as a predator reward. A predator reward is nothing but how likely this doctor would be giving you a fastest response, the higher the reward, the faster the response you can get from the doctor. So it's straightforward. You can just pick the doctor who has the highest reward, assign the question to him. So here in our case, Rahman has the highest reward, so you assign the question to him. But the important question is how do you estimate the parameter vectors? We look at the previous assignments made to the doctor. And we look at what are the true rewards we got from the doctor. We then apply simple ridge regression to estimate the parameter vectors. So all is good and we made few assignments over a period of time. We observed that Bob has got very few assignments compared to other doctors. So if you see, Shama has got 30, Alexa 50 and 45 and Bob has got only 9 because he was not answering a lot of them. So this doesn't mean that Bob won't be answering questions at all. He's just busy for the week, so we should give him more chances. So let's give Bob few more chances. How do we do that? So let's say you got a new question on a Sunday afternoon. So using the contextual feature vectors, we have calculated the contextual predator rewards here. Now along with that, we calculate now what we call as a regret bound. A regret bound is nothing but it shows how less information you know about the doctor. The lesser the information, the higher the regret bound. So in this case, Bob has a very few assignments. So we don't know much information about Bob, so the regret bound is higher. So we add this regret bound, the contextual rewards that we already predicted, sum it up and pick the best. So now Bob would get a chance to answer. This is a second algorithm in multi-arm bandits. It's called as upper confidence bond. If you have a mirror with statistics, it's very similar to you giving an estimate of a parameter u plus or minus sigma. The lesser the information you know, the lesser the sample size, the higher the confidence interval you have, because you're not sure of the parameter value. Similarly here, you're not sure of Bob, so you have a higher regret bound. So we have implemented this algorithm and some of the results we've seen was with the random installation of weights, we were able to see a 60% reduction in response times within 10 days. Also interestingly, we were seeing a 25% reduction, sorry, 25% increase in engagement of the doctors, which means that we are able to tap into those doctors who are actually participating with the platform and we are able to send more questions to them. That's for the console. Let's see some of the use cases where you can apply bandit algorithms. So one of the common use cases people refer is with AB testing. So user comes to your platform, you have two versions of your website. Let's say you know the true conversions, which is like 0.3 and 0.2. Let's compare AB testing with the simplest multi-arm bandit algorithm, which is epsilon greedy. So here the blue curve represents the average conversion of epsilon greedy and the green curve represents average conversion of AB testing. So for those who don't know about AB testing, AB testing is a place where you run an experiment for X number of trials. You randomly show one version to each user after you finish the X trials, you pick the version which has the most conversion rate. So as AB testing was doing a random assignments, the average conversion was lower than epsilon greedy. Epsilon greedy being the dumbest of all the multi-arm bandits was still able to perform better than AB testing. But after a thousand trials, AB testing starts to perform better because it picks the best version, which was version A and shows that to every user. However, our epsilon greedy is still exploring 10% of the traffic randomly to each one of the versions. So there are two points that we can take from here. If the behavior in the versions changes, it's good that we explore and the bandit algorithms would be able to adapt to those changes in the version. On the other hand, there are methods in bandit algorithms like annealing, which will make you explore less as you collect more data so that you can optimize this graph over AB testing. Some of the other applications are if you want to send user notifications and you don't know what time to send. So you can apply bandit algorithms to come up with a personalized time for each user. And similarly, news recommendations. So you might be like pretty much across most of the data science journals. So you would be seeing a lot of data science articles. But what if if you're interested in bitcoins and never come across such articles? So bandit algorithms would occasionally explore your interest in new areas which you have never seen before. And if you liked them, you would be like exploring new areas. So if anyone from the audience are from in shorts, you have a cool new project to try. So these are the two references. But if anyone who wants to get into multi-unbanded, bandit algorithms for website optimization is a very good start-up book written by John Miles White. And the paper, a contextual bandit approach to personal news article is the one that we have implemented. So lastly, when to use or not use bandits? So if you have a short experiment like a Facebook campaign or a marketing campaign and you have a couple of different versions to experiment, let's say you have a message like sign up and get 30% off or please sign up and get a discount of 30%. And you don't know which of these versions would give you a maximum conversion. You don't have time to run an A-B test and come up with a better version. So instead, if you put bandit algorithms there, it automatically optimizes for the better version and redirects the traffic to that version. Also, if you have continuous experiments like in console, we have doctors who keep adding, who keep moving away from the platform or their behavior keep changing as we were looking at Bob. So if such a case is there, you can use bandit algorithm which continuously adapts and tries to optimize for the conversion. So when not to use for bandits is when you want to run any test and want to infer which version is better than the other. So usually bandit doesn't care about what is the version, just cares about the conversion. So A-B testing should be preferred there. And like one example is clinical trials. Yeah, that's it. Questions? So you mentioned about the bandit algorithm and the A-B testing where the trials where A-B testing started improving after a certain number of time frame in one of the slides. So when do you think or what do you tell as a cutoff that you will say that I want to switch from A-B to bandit or what is the time frame that I'm looking at? Usually it's used as an alternative to A-B testing. So A-B test, all it does is it finds out that version which has highest conversion and just shows that to 100% of your traffic. But do we need to show that same version for every user? Maybe the other version is close enough or maybe if you use more traffic, the second version might be better than the first version. So bandit algorithm tries to adapt to that. So instead of A-B testing, it's an alternative to implement. It's not like after A-B you can use bandit. Yes, that's a random example. But the number of samples depends on the power of the test you want to have. The difference between the options, if both the versions of your website are roughly the same, then you need to look at more data to make a conclusion which version is better. So bandit would be better because it automatically tries to maximize for the conversions. It doesn't look at which version is coming out to be good. But if you want to have a stable website where you just want to have one stable version, then you can choose A-B testing and have that version. We'll take another question. You can take this offline. Any questions? So basically when the algorithm converges, it's more or less like an A-B testing, right? I'm sorry? No, when the algorithm converges. So then it's essentially the performance would be like A-B testing. So when the algorithm converges, you would still be having some amount of exploration happening in bandit algorithms. Always you would have some percentage of traffic which would be assigned a random page or a lesser efficient page just in case the behavior changes. You can actually set that in the bandit algorithms to explore less as you collect more data. More questions? Hi. My question is about the exploration probability. Two points. One is how do you determine the rate that you reduce the exploration probability? How do you determine that over time? And the second, how does the context of the problem influence the size of the exploration probability? For instance, if it's about news, in some ways people might have more of an inclination to explore more. So they'll tolerate. I didn't get your second question. So the size of the exploration probability, how do you think about the influence of the context of the problem on that size? So if you take news for instance, news are probably more open. Say one out of five to have a random article. It doesn't hurt me that much. And there's every chance that it's something that's exciting. But say in something like getting medical advice. One out of five cases if you give me a doctor who's not the best qualified to answer that problem, I'm probably going to be more upset about that. I'm just going to do more damage. So how do you think about the context influencing the exploration probability? So if I understand it right, how do we define the exploration probability based on the context? So how does it shrink our time? So if you are sure that all the arms, all the multiple arms has got enough samples, so you can have simple measure like you can use any log of N or anything very simple like entropy, anything, which can give you some measure of how much data has already been seen. If you have like 1000 samples or 2000 samples already collected, that can give you a measure of how much exploration has already done. Based on that, you can reduce the exploration probability. So it's a 0.1 divided by log of N, something like that. It keeps decreasing as the N keeps increasing. So the second question, it depends on the context and the conversion probability. So how critical is your conversion probability? In a health case scenario, the conversion probability might be super critical than other contexts. So it depends on the use case. Last question, anyone? Cool. I guess that's it then. Thank you, Santosh. That was a great talk.