 Welcome everyone. These are the two words that you'll remember after this talk, exploration and exploitation. They are the main concepts behind the algorithms that we are going to cover today that are called multi-armed bandit or just bandit algorithms. Weird name you can Google for the historical reasons. They are part of a bigger family of the reinforcement learning that is more popular than the others in the last years thanks to projects like Deep Mind of Google or chapter itself that is using reinforcement learning to improve the quality of the answers. So they are in many scenarios today most of the applications that we are using mobile web they are constantly learning from you, from your habits, your preferences and they are adapting somehow. For example, Netflix. Netflix is constantly changing the home page, the images associated to the movies to select the images that are most attractive for you. They want you click, they want you watch the movies. So my home page is probably different from yours, yours, yours. And everything is publicly documented. On the slides you'll find the study. They are explaining how they are using these algorithms to adapt the home page. This is public but I assume that also the others major streaming platforms are doing the same. So everywhere you have a home page with the thumbnails probably they are using this approach. Another example is The New Times that is constantly changing the titles of the articles to select the most clicked one. You know how clicks are important. So they are changing and finding the best titles to make you click. Click means money. And again, they are disclosing everything. They are disclosing the technical details and they assume that most of the new news websites at least the major ones are doing the same. Advertisements. In general that's the main industry, one of the main field of application of these algorithms. So reinforced learning and bandit algorithms. The goal is to show the right image, right video, right text to make you click. And there are plenty of researches, plenty of articles. So these examples are on the front end, on the UI, but the same approach can be used also on the back end. So for example here there is a use case for the network routing, but also in technical fields like technical trials and financial portfolio optimization. Same approach. So how it works? Let's take the advertisement scenario. Suppose that you have three banners, blue, red and green. You have to show one of them your website or your mobile application. You know that one of these is more clicked and you want clicks because one click, let's say, brings you one dollar. But you don't know in advance which is the best banner to show. So how you do? You are against the exploration phase, the first word. So you have to learn something. So you are going to use the basic reinforcement learning approach. You have to learn something of the environment, of the world, so which banner is the most attractive. You do something, you analyze the feedback, you observe how the environment changes based on your action and you adapt. Very simple approach if you think we do it every day because it's one of our learning process. It's modelled on biology. These are my kids, my children, different ages, four months and four years, but they are both using reinforcement learning to learn the new skills. So they are trying random moves, random actions to explore and learn and the next day they will discard the actions, the moves that didn't work and they will keep what they liked. We learned to survive in this world thanks not only but also thanks to reinforcement learning that is embedded in our brain, animals brains and so on. So let's say that we explore the environment and we randomly show the banners to the users. So after a while we have some performance score and we see that the blue banner is the most clicked. So every 100 users we get 10 clicks, 10 dollars. At this point we can start the exploitation phase. So we explored, we get some knowledge and we can exploit the knowledge. The optimal strategy to maximize the profit if nothing changes in the environment is to always show the blue banner, right? Is there anyone that thinks that showing the blue banner is not the optimal strategy? If nothing changes, the few people you are losing money because it is, not fixed. But assume that there is something that changes in the environment. There is news, there is a viral video, you know how it goes on the social media. Something suddenly happens and people are interested in something else. So the environment changes and now the red and the green are better. So at this point, same question, who thinks that showing the blue banner is not the optimal strategy? Good, good, good, yeah. Got the point is not now because the environment changes. But there is a problem. I told you that the environment changed. The algorithm is still in the exploitation phase. So it's always showing the blue banner because it was the best action, right? So how do you solve this problem? You solve as you can guess mixing exploration, exploitation. This is the exploration, exploitation dilemma. So you allocate a portion of the traffic to explore and test and continuously test the non-optimal choices because you hope to find something better. It's a risk opportunity problem. So you invest something to find a better opportunity. So how much to invest is the challenge in these algorithms? These algorithms are public from decades, but the parameters are the secret. Any company is tuning. In many machine learning algorithms, the parameters are the secret. So usually you start exploring a lot because you don't know nothing like a baby. And then you decrease. So as long as you learn, you decrease and you exploit your knowledge, but you keep some exploration rate in background. This is what you find on books and papers. In real world, it is this way. So whenever the environment changes, you increase the exploration rate. When I design these systems, I always define or try to define a metric, an indicator that tells me if the environment is changing or not. At that time, it's time to increase the exploration until I know the new state. If you cannot define, you simply sample over time, you increase the exploration base. The challenge is that there are not only three banners usually. So there is this metrics and the metrics that was in the description of the session. So imagine thousands or millions of possible banners or in general actions multiplied by thousands or millions of user profiles. All of us are in a profile. Even if we don't register to our website from our IP, we are at least labeled by country, but they also can guess the age based on patterns, cookies, and so on. So the goal of the algorithm is to choose the best actions for every profile. The Netflix homepage is different from profile one, two, three, and so on. And there are different strategies. You cannot just sample everything randomly. Because you don't have enough traffic, you don't have enough time, resources, sometimes doing an action, depending on the scenario, for example, in the clinical trials, doing an action is dangerous and expensive. So it's not like showing the banner. So there are different strategies. And these are the most popular strategies, algorithms. If you're interested, you can click on the slides. There are some links. If you're interested to see some code, you can also take a look at, on my GitHub repo, there is this project that contains the implementation on those popular algorithms. And there is also simulator that compares for the advertisement scenario. That's because I worked in these industries for four years, compares the performance of the different algorithms. And if you want to, again, continue and learn more, I can also recommend this book. This book, this one. It's 10 years old, still valid, pretty simple. Again, it contains in Python some explanation also on the other strategies that are available. And so we are at the end. This is a complex topic to cover in 15 minutes. But what are the main takeaways? You have seen that we are surrounded by this algorithm. So reinforcement learning somehow is everywhere, is embedded in us. We are using it unconsciously every day. But also technically in the applications, in the web, it's driving somehow the application. If you're interested to use it, you have seen that you can use it both on the front end, that is the main field, but also on the back end. But also if you never implement these algorithms, it's still useful to know the concepts of the exploration exploitation because if you pay attention, you'll start noticing while you're using the applications. When you are in the exploration exploitation, while you're scrolling, you'll see some new content, sometimes labeled like adjusted for you, new for you, that is unrelated. If you click or you stay for three seconds, it will be considered a positive feedback and you'll start getting there. So most of the information that we received today is influenced by bandit and reinforcement learning. So it's good to know. Is it bad? Is it good? Up to you to decide. The important is to be aware. And I hope now you are. Thank you. Two questions. If you're faster, more. Yes, so the depression is the difference between A.B. testing and this approach. One difference is that usually A.B. testing has only two options. So here we have multiple options, sometimes no limit, limitless options. And yes, not sure. Probably also A.B. testing can be adopted in some way, but this approach, as you noticed, can be used continuously. So in theory, the mathematical formulas that are behind allows to retrain the feedback for every click that you receive you could train. And yes, I have always used these algorithms in a continuous way. Let's say, learn as you go mode. So you start with nothing. And this is also good. There are no assumptions. You can start with assumptions. But they are pretty good in discovering things with zero knowledge, like the babies. Out of time, thank you. I hope you enjoyed it.