 Oui, vous voyez nous, nous sommes là. Oui, nous avons vu un peu. C'est juste une grande salle. Ok. Alors, Milo va parler de cette session de Paris. C'est un set-up de la technologie haute. Parce que c'est un set-up de cette session. Alors, Milo, je serai là. Je serai juste un mec, un micro-carrier. Donc, je passerai le micro à tout le monde pour les questions. Mais vous êtes le boss. Ok, merci beaucoup Jean. Merci tout le monde d'avoir regardé. C'est un chien. Je n'ai pas été avec vous. J'espère que je serai capable de vous joindre dans les prochains jours. Alors que le temps est allé. Mais ok, donc cette session va être sur les applications d'application qui peuvent être rélevées pour la science en général. Et ici, la première session, la première session va être donnée par J. Howlong, qui est venu de l'Université du Président. Récording in progress. Ok, ok. Could you help me? Yes. Ok, ok. Thank you. Hello everyone. Thanks for your invitation from organizers. I want to talk about my recent work with J. Howlong, about the reinforcement in the production of Hilbert space. Reinforcement is algorithm, with a huge size in goal, video games and robotics. For unizing powerful functions automations like kernel or neural networks, the practical reinforcement can deal with programs with a huge space space and high dimension. However, with existing theoretical works, mainly focus on typology, in which both space space and action space are finite. And the sum of complexity is proportional to the size of the state action pairs. To explain the practical exercise of reinforcement and the power of function optimization, we will consider the reinforcement with kernel function optimization in this talk. Let's begin with a short introduction to the reinforcement program. Reinforcement is considered with how to maximize expected combinatorial rewards in a more competition process. This talk will focus on the timely homogeneous case or the episodic amount competition process. In this case, the length of the MDP is finite and the reward function and the transient probability vary from different time steps. An episodic MDP consists of six components. A state space S and action space A and initial distribution mu, the episodic length H transition probability P and the reward function R. L'enmer�� est���able. Qu'est-ce qui nousomme si ça nousætre en parlant de la fenêtre ? Je masique Je vois. Je dois vous dire une question. Je suis désolé pour la communique. Ne t'inquiète pas. Si on peut faire mieux, c'est bon. Mais si c'est trop compliqué, on va y arriver. Ne t'inquiète pas. Pour nous, c'est bon. OK. Donc, vous pouvez y aller. Bonjour. Ça va bien? On vous écoute bien. OK, merci beaucoup. OK. So, the transition probability gave the distribution of the next state, I sub h plus 1, with given state I sub h and option A sub h, and time step h. Rural function r gives the expected reward at the edge, I sub h and A sub h. The exact form of the first transition probability and the reward functions are unknown to us. But we can connect samples from them with any given state action purpose. We need to decide a policy which gives the distribution of action at any given time edge state I sub h to maximize the total reward. We denote the declaratory reward of an MDPM with a policy pie as J of M and I. Here, I sub h and A sub h are generated by the transition probability p, initial distribution mu, and the policy pie. We then use rho sub h, p, pie, and mu to denote the distribution of I sub h and A sub h generated in this way. Our goal is to find the optimal reward of J star and the optimal policy pie star through finite samples. So, we introduce the reward using corner helper space, which is closely related to the corner function optimization. Given a positive definite corner k on S times A, we can introduce the archival function h sub k as a combination of all possible linear combinations of k sub x i under the so-called archival norm. According to that, our goal is to answer when a reinforcement program can be solved by corner function optimization. We are mainly focused on high-dimensional problems and the sample complexity. Our complexity, like computational complexity, is not conserved here. Although it's hard to start the reference learning with corner function optimization, the theoretical results of supervised learning with corner function optimizations are well known. So, for any k function f, line in an archival stress h and the efficient distribution nu, one can efficiently obtain estimate h f using corner function optimization so that h f and f is close with respect to L2 norm on the distribution nu. Note that the sample complexity of the supervised learning program does not suffer from conserved functionality. You know it was mainly the reward function plays a similar role with the target function in supervised learning. This motivates us to study a more concrete question if the reward function lies in an archival stress. What is the condition of the archival stress and the transition probability to ensure that the reinforcement problem can be solved efficiently? Compared to supervised learning, reinforcement learning is more difficult due to the distribution mismatch from the phenomena. Here we take the Q-value estimation as an example. The optimal Q-value function is the optimal expected to reward from the time step edge to the end if the state and action at the time step edge is just s and a. The optimal policy is known to be the gradient policy of the optimal Q-value function. Now, if we have an estimation of Q star, so that the actual distance is more under a pavement distribution nu, then what about the performance of the gradient policy of hq? We can use the famous performance difference dilemma as you can say that the difference between Q star and hq is assumed to be small under the distribution nu and the difference between hq and hq can be found by 1. The question is that in this inequality we involve this distribution those sub h, p, hq and nu. But we don't know the hq before we estimate the Q-hat. Because the Q-hat is just the gradient point of the hq so we need a force to have the hq then hq. So it's impossible to make the probability distribution nu for estimation under the distribution low of h, p, hq and nu the same. We call this phenomena in the analysis of the overall difficulties brought by distribution mismatch we will introduce the computational complexity by distribution mismatch The core idea is that although we cannot access to the distribution low of hq, p, hq we know that this distribution is the action distribution of the MDP generated by a policy. Hence if we can find the difference between Q star and hq under all state action distribution generated by all policies on the performance difference To refer this idea we introduce the perturbation response by distribution mismatch as follows First, for any set pi of probability distribution on s times a we define a same norm called pi norm on the continuous function space on s time flow but as the maximum compensatory variable of hq under the distribution low chosen from pi Second, given a general analysis space and a positive constant epsilon greater to 0 a probability distribution nu belongs to a probability distribution nu on s and s times b we can define the distributional space with k epsilon as b sub epsilon and nu which is just the function space with the function whose b norm is bounded by 1 and l2 norm is bounded by epsilon This is a multivit that we are super learning in architecture we can efficiently obtain an estimation of the target function so that the difference is bounded and small on the l2 norm Finally, the perturbational response by distribution mismatch is just the radius of b sub epsilon and nu We then gave some examples of the perturbational response First, if pi only consists of the distribution nu the perturbational compensates can be bounded by epsilon This means that without distribution mismatch the perturbational response is very small second, if pi consists of all distributions on s times a then the pi norm becomes l infinity norm which is used to handle the distribution mismatch phenomena in tableaux rl However, the estimation with respect to l infinity norm in architecture may suffer from cause of dimensionality which is the main difficulty for rl in architecture compared to tableaux rl In the definition, we can say that the scale of perturbational compensates may use the discrepancy between nu and pi and reflects the error due to the fact that we don't know the state-accueil distribution on the policy of nu We then define the reinforcement problem by simplifying the prior knowledge of mdp So we want to solve an rl program whose underlying mdp belongs to a family of mdp and we know the exact form of the family m but we don't know the exact value of the data In mdp, all mdp show the same state space, action space mass and initial distribution In this talk, we already consider the case that the index side is the partition product of the index side of the reward function and the index side of the transient probability which means that there are prior knowledge for the transient probability of the function independent In the part of a law bound we assume that the reward function lies in the unit of all of a general of a space while in the part of a bound we assume that the reward function lies in the unit of all of an arc address with a kernel k Finally, we can assist to a generative simulator which means that for any edge and the state-action pile we can observe a state variable from mdp sub-theta conditional from edge, s and a and the noise reward which equals to r of edge s and a plus standard Gaussian distribution This is called one access to a simulator So, a generally infrascining algorithm can be viewed as a mapping from theta to a real number as an estimator of the optimal value of m sub-theta The following form ensures that this algorithm already depends on all accesses to the simulator Each time, we choose a state-action table based on values data then access to simulator to get the next state and noise reward corresponding to the table and put them into a dataset Finally, we compute an estimate of the total reward based on the whole data Here the variable denotes the randomness of this algorithm Our goal is to find the best algorithm which minimise what's the case error with respect to the state In our paper, it comes to two cases one is a simple case called the law in transition case in which we know the transition probability and hence the exact site of the transition probability only equals 0 Another case is a general case in which we don't know the transition probability and the exact site is a general site So, due to time limitations I can only discuss the normal transition case in today and you can find the general the results of general cases in our paper Or, we always use pi of hp and nu to denote all possible distributions of state-action panel on the transition probability p and the initial distribution mu The pi of p and mu refers to the union of pi on different times type We define the perturbation of complexity by the distribution mismatch of an MDP family M in the case of law in transition by just the replaying pi by pi of p sub 0 and mu the site of all possible state-action distributions and they take even more choices of the estimate distribution mu We show that the perturbation of complexity of M is a law bound of the work case error of all possible reinforcement algorithms with n samples Therefore, this result shows that the sum of complexity of the reinforcement programs on the MDP family M is related to the decay rate of the perturbation of complexity with respect to epsilon To get an up bound based on the same quantity which reduces the fatal reward algorithm These algorithms we first find the distribution head mu which minimise the perturbation complexity and then I had the sample M square points from the head cube I actually times type the access to the noise rewards on these points We then estimate the reward function and this times type a simple connex regression The output of this algorithm is just the optimum point with respect to head M sub theta which just replace the reward function in M theta by the estimated reward function Our law shows that the performance difference can be again bounded by the perturbation of complexity Notice that the sample size in the up bound part is M square but in the lower bound part our lower bound and up bound are not matched and they are still removed for improvement Finally, we discuss some properties of perturbation response and perturbation complexity and application of all results If we can find a distribution mu the derivative of all distributions in pi which is 1-2 mu have uniformly bounded Lp log which correspond to the perturbation of complexity or perturbation of response Decay is fast This is called a concentration coefficient assumption in the literature of the learning theory Hence, under such assumption we can recover the common results in the literature through our up bound Another common assumption in the literature used to prove convergence of reference algorithms with connex function optimization is that the perturbation of all probability distributions is closely related to the eigenvalue decays Since the perturbation complexity of all probability distributions is the largest of the perturbation complexity of all probability distributions we can conclude that the perturbation of response to K5 for any distribution inside pi On the other hand the right side of the force in your quote is as slow as possible and one can consider the hard reinforcement problem by making the side of all possible state by making the side of all possible state action distributions in MEP closer to the side of all probability distribution So, one can conclude that when eigenvalue decays fast we can expect a variable performance of reinforcement algorithms because of the perturbation of complexity decays fast But, like a variable decay is slow like in the large publishing column and the real tangent column on high-dimensional sphere the low bound the low bound in the force in the quote decays very slow In this case, the knowledge of pi plays a vital role A direct application is that if the state space is just a single point and h equals to 1 then the reinforcement problem is essentially to find the maximum value of the reward function lying in a unable of architecture based on value of endpoints So, due to this due to the hardness of this question in real RL if we need to handle high-dimensional action space we need to assume the decay of eigenvalue is fast to break the cost of attentionality Ok To conclude the distribution of complexity by distribution mismatch gives a low bound for the error of error algorithm on the considerably important algorithm a very important problem In the case of non-translation the prohibition of complexity also gives an outbound of the error on the space that reward them In the case of unknown transition as we are now covering the cost in our paper with an additional assumption of an outbound for the error of the so-called fetch-recruiter algorithm and all results used with the prohibition of complexity generalize the existing results for faster convergence based on the assumption of the final contribution or faster algorithm value decay Finally, in our paper we give a concrete example in which the reward functions lying in a high-dimensional architecture The action space is finite but the cross-bounding reinforcement can be no sovereign result of the finite So, due to time limitations I cannot draw this example into detail but you can find this example in our paper Ok, thank you for your attention Thank you Joe So, we cannot hear the claps if we can see the claps from where we are So, are there any questions in the audience and here I will ask my fellow organizers to relay what's going on in the audience I'm on it Ok, there are no direct Hello, hello, can you hear me? Thank you very much for the talk I was wondering if there is any way around the course of dimensionality and if you could give us a little intuition of the concrete example that you mentioned Ok, ok So, this one right So, the interesting is that as I say the put, so this term can be very, can be taken very slow in the condo like the Russian condo or your Tanya condo So, what can you castrate in an example like the side of all probably all the action distribution is very close to the distribution of the all distributions as times day So, in the sense the lower bound is very close to the system and in the system for the course as your Tanya condo or the Russian condo starts from a course of internet so by this way we will improve the quality of that example Is that clear? It is, thanks Any other question? Right, I think on our side it's there is no more question So, maybe I can ask a quick question then So, I mean I'm no expert in reinforcement learning but I think it's a nice study where you can have some theoretical understanding of those complicated learning frameworks and I was wondering from your perspective how far are the assumptions you need to take from practice and whether you see that you can take any more steps closer to practice and retain some analytical conclusions Ok, so basically there are two major jobs of this org, so first is this one we assume the possible transient property or probability of the transient property and the real function So, this may not true in the practice so another thing is that our real function lie in the unible architecture which means that we don't know the real function but in practice in many RL problems we know the real function so in this case, in that case we cannot solve all the results so I guess this is a major job and try to do that ok ok, thank you any more questions in the audience I think we are good thanks a lot again yeah, thanks a good job again welcome and so our next speaker will be Mario