 Thank you for the great introduction and thanks Avi as well for the fascinating talk. So actually first off I actually wanted to thank the organizers for perfectly scheduling my talk because I do plan to build upon some of the ideas that Avi mentioned particularly around using theory to ideal models and also quite a bit on Chris's talk this morning which is opening up the black box and actually seeing what's really driving your inputs how they translate to the output. So I'm Ashish Kabra. I am assistant professor at University of Maryland and I'm in this department called Decisions, Operations and Information Technology which is a nice overlap of people from operations research, industrial engineering, computer science, economics, statistics so and we are housed in the business school there so the main idea is to actually work closely with business application rather than just building theory on its own which can actually sometimes happen in the academic departments. So what I want to talk about today is using structural estimation methods which have their roots in economics and more than a concrete method they are more so to say guiding philosophy or a paradigm and the idea is actually to follow what the data generating process more closely like think deeper into what's really going on. So it's not too different from the idea now popularized by Elon Musk which is to think from first principles so that is something that I'm going to do today and these models actually take up there is no concrete formulation they take up various forms depending on the exact problem that you're studying so I do have a very concrete application in mind which is these bike sharing systems and so in general you would want to apply these methods when you don't just care about making predictions from certain inputs to certain outputs but you also want the discovered relationships to make some sort of a decision on top of that. So with that let me just start so when I'm thinking about solving a particular problem especially in these kind of context I'm thinking of my model to have two main features in them. So the first thing I would like to have my model to have is the idea of interpretability that is I do want to understand how my inputs are driving my outputs I am not really comfortable with using a black box model and the advantage of doing that is basically one it's very easy to communicate to managers and stakeholders what's really going on so that they are more comfortable in using the output of the model in making some decisions based on that setting up prices making some sort of a decision in terms of acquisition and things like that. In addition catching off errors in our inconsistencies in your data becomes much more easier if your model is really transparent otherwise things would just flow on without you ever noticing that there is something of us. There could exist many biases could exist in your data especially around some sort of a racial biases or gender based biases and if your model is not transparent enough you might just let that pass through whereas if you actually have a good understanding of what's driving input to output you will actually be much more cognizant of what kind of biases exist in your data and finally once you have a very transparent view of what your model is you actually are more cognizant of what your limitations are and therefore you probably not apply to situations where you should not apply your model and that could be a good thing. So that's the first thing I'm looking for in my model is my model should be interpretable it should make sense to me. And the second thing which again builds up on Avistock is I would like my model to have causal relationship and so we have we are very familiar with this idea of correlation is not causation you have many funny examples running around internet this is one example where there is a huge correlation between people drowning after falling off a fishing boat and marriage rate in certain state and it would be actually foolish to say one thing drives the another and somehow try to influence one thing to make an impact on the other thing and that's the main idea and you want to establish causal relationship if you want to use your model to make any kind of decision making so this idea is well understood except what really happens in practice is we get so hung up on throwing thousands of malls at data and sort of chasing this measure of accuracy and then finally seeing what really works that this idea somehow takes a backseat and we never really come to making sure is our model a causal model or is just an association and things can go really wrong if we are just willing to get ignore this particular thing. So that will be the focus of the model being excised that I'll do next. So let me get to the application that I have in mind which is the bike share systems. So just to get an overview of what the systems really look like. So bikes I mean bicycles here it's a little bit different from how we use the term bike here. But basically the idea is that you want a bike share system to be used by public as a public mode of transportation usually shared by different people. So basically you have these kind of bikes which are study enough to be used by many people typically you have these stations where there are these empty docks and the bikes can basically sit in those empty docks and so the way these would be used by people is basically you just go sign up have some sort of a daily monthly yearly subscription you then go to a particular station swipe out a bike ride it to any destination of your choice doesn't necessarily have to be the same place that you started with and that's it you finish your journey. So that's how these systems typically work how people have typically use them is for a lot of short trips going from one particular place to another particular place also one popular uses sort of this last mile connectivity that you are planning to take a long journey but you do the initial first mile using a bike share system go to a metro station take that metro tip and then again take a short bike trip to your work destination. So that's typically how these systems are being used. Now what I'll talk about today will be in the context of bike share systems but a lot of that will just apply much more broadly to systems which are similar to these. So in particular I'll be focusing on two features of the systems. So the first is the location aspect what I'm thinking of in what is important in these systems is the location of a user and location of where the supplies say nearest bike is is quite an important factor in determining whether that match really happens. So that's the location or the hyper local aspect of the systems and the second aspect is the on demand nature of these systems which is you don't typically reserve these systems well in advance. You just have a need for one of these let's say a bike trip right now and you then of an open an app go to the nearest station check out a bike right then and there and so this temporal on demand nature is quite important in these systems and well if you think about other smart mobility systems be it the dockless bike share systems that are coming up now zoom car is launching the pedal system or has something so all of that have both these location and on demand features if you think of Uber if you think of will again very similar driver has a particular location I have a particular location and for the match to be happen they have to be close enough and they're at the same amount of time and again more broadly around delivery platforms be it swiggy big basket and so on all these kinds of applications have very similar features. So a lot of what I'll talk about today will apply just much more broadly in systems like this and so the very concrete system that I'll be looking at is this belief system that I worked with which is a huge bike share system in Paris. So it was launched in 2007 it's actually one of the most popular bike share systems around the world also was one of the first modern system of its kind. A lot of other systems have actually followed the system in how to design it. So it has more than 12 illustrations about 20,000 bikes hundreds of thousands of subscribers in terms of number of trips it has done over 173 million trips in last six years and has a huge impact has had a huge impact in terms of say the carbon impact and saved sort of a car journey trips being replaced by these bike share trips or a lot of health benefits as well that people actually bike instead of sitting in a cab and just driving to this place. So that's actually one of the fascinating angles as well for me to be working with these kind of systems that there has to be there is this potential to make an impact in either a climate or a health sort of a effect to the community in addition to just doing data science in these kind of things. So in particular when I think about these systems a lot of the concrete problems actually depend on understanding two main features of how users behave in this kind of systems. So the first is it is important to think about how users think about distances that they have to walk to access one of the nearby station. So that plays an important role in understanding if I want to see how a user behaves if the distance to walk is 100 meters versus 500 meters how less likely are they to make that journey and subsequently use the system that will actually play an important role and the second aspect of the system is actually underlies a lot of problems that we can study about these is how people think about availability of a bikes which is say I go to a I have this preferred station somewhere and I usually go and check out a bike and do my subsequent journey. But at this point in time the station does not have any bike over there. So what do I do next? Do I actually substitute to a nearby station in which case I still make that journey using the system or I just abandon the system and do something else let's say order or Uber and just not make a trip. What how do users behave with respect to that availability feature and why the understanding you see these two things are important is because if I think about a lot of questions around the system be it where should I how should I think about where do I locate these stations. Well it's important to understand how people think about walking distances because if people are very sensitive to walking even a few hundred meters then I actually need a lot of stations in the city whereas if people are okay walking five hundred meters or even more then I don't need as many stations. So that will actually guide the design of how dense I want my system to be once I have an understanding of how people think about walking distances or if I want to understand about how should I manage the inventory of the system how many bikes do I need how should I think of rebalancing bringing bikes from the full areas to the empty areas again what will guide those those algorithms or those strategies will be basically how people think about availabilities are people very sensitive to find not finding a bike at that station in which case I'll have to be very careful making sure the supplies there at each and every point in time whereas if people are okay not finding a bike once in a while and actually substitute to nearby stations well in that case I don't need to be so stringent about my inventory management policies. So that will basically the agenda of my talk today I'll try and uncover how do people think about walking distances how do people think about bike availabilities and then use those primitive user level primitives to actually guide a lot of system design kind of questions for the systems and it will turn out that actually these are a little bit non-trivial problems I'll first illustrate how traditional problems struggle with capturing a lot of realities for these for modeling these kind of features around users and how structural estimation methods can guide us into getting a better answer. Okay, so the data set that I'll be using primarily comes from the belief system in Paris. So this particular primary data set looks very simple. I'll complement it with a few more data sets from other sources, but basically the idea is that I observe each and every station at a very regular interval every two minutes in this case. So what I have is this station ID a particular station 666 and every two minutes I observe what is the number of bikes those systems have. So that's it. That's the basic structure of the very simple data set that I'm starting with and once I have that I can uncover what is the demand at each of these stations at every two minute one bike was taken out three bikes were taken out and so on. So that's the number of trips originating at each of the stations and also I can construct from that whether a station has any bikes or does not have any bikes. So this binary measure of whether that station is sort of available for use for users or not really available. So that will be an important factor as well in understanding the availability aspect. So that's the basic structure of the DSS. Now, let's actually come back to a question which is we want to understand how people think about walking distances. So another way to put that question is let's say this is my existing design. I have four of these stations in my city, a blue station, yellow station, a pink station and a green station and say I want to move this blue station or I'm considering if I want to move this blue station by 100 meters and based on the data that I have collected so far in how people are using the stations. What is the demand at each of these stations? I want to say something about what will be the new demand once I move this blue station by 100 meters. So that will be another way to put this accessibility so let's say I start in a very traditional way. Basically the dependent variable that I'm thinking of is demand at this station f and time t. So that's something that I care about and I want to build them all what guides that demand. So what I start doing is basically building in different features and what could impact that demand. So let's say I start with some neighborhood characteristics say if there is a metro nearby there is a is it the commercial location are there cafes nearby and so on. So all of that should have some sort of impact on if a station is popular or not really popular. I can include some other characteristics whether it's raining, whether it's a hot day, what's really going on in terms of other humidity and so on. So that could actually have a potentially an impact on the demand at the station. But remember I also want to have some sort of notion about where the station exactly located the distance aspect in how that is driving users to actually use the system or not use the system. So one way to think about accessibility here is I can include a measure for what is the distance to my nearest station. So if the nearest station is very close well in that case typically the users who are coming to the station do not have to walk all that much so they might be willing to use this station more often. So that could be one proxy for how I include the distance effect in my mall which is distance to nearest station. I can go a little bit crazy. I can include distance to second nearest station third nearest station and so on. So those all of that would be kind of a proxy measures for including this distance effect in my mall. But then I take a step back and I think have I really captured how the demand is driven by the location of the systems. Well turns out not exactly because by including this distance measures let's have focusing on this blue station over here and my measures say that well the nearest station is 100 meters away and the second nearest station is 200 meters away. The design could still look either like this or like this in either case my features look exactly the same. Whereas we know just by looking at this that in this scenario this blue station would have more users coming to it just because there is no other station in this particular area. So that spatial positioning of station is kind of not really captured in this distance measures that I'm including here. So I can then go ahead and try to construct some features which actually capture not just the distances but also some sort of a 2D measure of how their position around the station. But essentially what I'm really doing is trying to come up with a lot of proxy measures to capture this single notion of how people think about walking distances. So it's kind of getting a little bit not that clean. If I think of some other things which is I also want to model that well there is a metro station which is 300 meters away. So that's the distance part but also has a lot of people coming per hour. Let's say there are more than 1000 people using this station very regularly again to include that kind of feature in this particular model and have to come up with a lot of interactions building around the distance aspect and the number of people aspect. And what's really going on here is I'm including a lot of proxy features a lot of interactions and my accessibility aspect is really getting scattered all over the place. I'm not really getting a hold of the thing that I'm really after. So let me actually start from a little bit different angle. Instead of using this particular approach which is what we are really familiar with having an output putting in some features. Let me actually start thinking about what really drives a particular unit demand at each of the station at time like what really happens to make a user use that particular station at a given time. So think of this as one particular use case which is a user got out of a metro station and now is thinking how do I get to my work and he has a few choices which is either to order Ola and just not think about bike share system at all or he knows that there are a few stations which are a certain distance away from him and he can actually go and use one of the bikes from these stations and that could actually that is what generates a demand at the station. So if we think about the demand generation process like this then we are still capturing this accessibility effect and the way that happens is basically say if I move this blue station away by 100 meters and now I think about this process again that there is this particular user who got out of this metro station and really now consider using this blue station. Well now the station has moved much further away so probably he'll be a little bit less willing to use that station. So that might drive the demand at this blue station a little bit lower whereas there might be other users who are somewhere over here and they now may want to use this blue station just because it's like very close to them. So actually by starting from this very user level paradigm I have in a much concise and neater way captured this accessibility effect. So I'll dig deeper into have I really captured it all those problems around spatial positioning of stations and things like that do they really fit into this particular way of thinking about the demand generation process but that's the main idea is basically I am following a user instead of just putting in a demand ft and plugging in features to explain that particular demand. So let me formalize this thing. What I'll. Okay so before that actually one thing I do want to mention is I don't exactly have user location data. I don't know where the users are coming up and where are they now starting to think about okay there is a nearby station and I want to go there. So that will become a little bit of a challenge but I'll get around that. And one more thing to note is that a lot of applications these days are based on these apps which are able to track user location. So in some cases you might have user location data and I'll talk a little bit later in if you have the data what kind of change the model required to incorporate that data and does it really solve the challenges that I'm talking about or are there still some challenges to think about. So so let's start with this model building process. So it's it's fairly simple design. I'm still formalizing that intuition which is following the user. So in the first step I'm allowing my users to originate at different points in the city. So at every single point in the city which is this location L it's a PWL it basically captures the rate at which users are originating in my city and that I can make that as a function of a lot of city characteristics. So I know that a lot of people with originate where there are metro locations with a lot of traffic where there are a lot of tourist locations where there are grocery stores cafes museums things like that. I can also include some sort of a census data and how many people live there how many people work there what are the demographics and so all of that will basically guide the number of people originating at different locations of the city. So these are not the people I'm not thinking about using the bike share system right now it's just people are originating and they could be potential users of my system. And then there is a second aspect which is this neighborhood bike availability which is if people are more confident that they'll be able to find a bike then more people will sort of originate at those locations so that can play a little bit of a role in your density model. So at this point these are the people who are originating and then once they've originated at a location now they start thinking about if I want to use this bike share system or if I want to do something else. So to formalize that process I'm using a choice model which is which is not too complicated. I'll just give you a brief primer on what that really means. So what I'm saying is well there is this user and the way he's thinking about this system is he has a few options and for each of those options he writes down what is the value of using that option. So that is what I call a utility and that can be function of what is the distance you have to walk and some other station level characteristics and once he has utility from all different options including the utility from an outside option which is say using an ola the option that gives him the highest value he basically goes and chooses that option. So if that turns out to be this particular station this blue station well then you have a demand at the station and if it turns out something else well then in that case you don't have a demand. So for a user this is a very deterministic process choosing an option which gives him the highest utility but then there are certain components of it which are only known to users but not known to us which are these error terms we typically have a standard distribution for those terms. So essentially what we are going to do is from this choice model we are going to assign probabilities of a user look originating at a particular location li of using some station f at a location lf. So what I'm going from here is writing down these particular probabilities of a user i using a station f at time given that he's originating at particular location li and all that distance aspect is basically embedded in there. So what I've done is basically two things one I have created this model of where users are originating in different places in the city and the second is once they originate how do they think about using either my system at different locations or something else. So my demand model is basically now very easy it's just a composite of these two things if I just integrate over all the locations in the city where are people originating and how they make a choice well then I have gotten a function which describes the demand at each particular station at a given time. So let's dig a bit deeper into what this model really means has it really captured the distance thing that we are after or the by capability aspect that we are after. Uh so well uh so say if I think about a particular user and if there is a station which is just further away from where my you all my users are will that station drive have a smaller amount of demand well my model would say yes because there is this utility part in the model which depends on the distance to the station and stations that are further away will make them less attractive to users and therefore users will not show up there. Is the effect of density of stations captured in my model if there are a lot of stations in the city well then typically users would have to walk smaller distances to access these stations. So again the utility model that I had there will make stations more attractive just because they have to walk smaller distances and so having a higher density of stations will make users use the system much more. So that's again embedded directly in there without me having to do anything exclusive and the third important thing which is this basically this spatial positioning of station that is you have a station here and you have a station nearby here and you have a station nearby there does it have an impact on demand of this particular station again. I did not have to do anything explicit it's just naturally in there because once I have a station that is nearby user always over there find that station much more attractive than coming to this particular station. So again the utility model just drives all of these dynamics very naturally without us having to really sit down and think of each of the corner cases and design these proxy feature variables to capture that. So that's the sort of beauty of these structural models that I'm talking about. Once you follow the data generation process a lot of things just are included very naturally and the same thing about availability if a station is out of bikes. Well that does not show up in users choice. It's a very naturally users with their substitute to nearby station or use say Ola caps and again those ability aspects are being captured there. But so let me actually come back to this particular aspect which is if I do have location data how would that player own in this model. So it's great if you have the location data but then we have to be a little bit careful about what that location data actually means because remember the location we are after is a user it has particular location and he's considering whether to walk this particular distance to use the station from this particular place. So if you have this particular user whose journey looks like this that let's say he's at home right now and there is this first metro journey he's taking from metro station one to metro station two and after that he's considering going to this blue bike share station. He can open his phone and check the app at two different places. So one is after finishing that metrotip he can check his location or he can actually be more proactive he can check sort of what is the state bike share station status even before taking the metro journey. So if you're not really sure if the location you're capturing either this or this this would actually be a bad data supply your model. So this is something that you'll have to be a little bit careful about again not all the users are often checking the apps. So you probably don't have location data on all the users and that could actually be a little bit problematic. We ignore those users. What do you do about that missing data? So that is something that you'll have to think about. The most important aspect is well you might have location data on your current users but what about those users who are currently not using the system but if you were to add more stations or if you were to improve the inventory policies they would actually start using your system. Know nothing about them because they are not currently users but to make any sort of counterfactual analysis using this data that is if I were to add more stations who is who is coming to me you want to have some understanding of what are the locations of these non-users and that is actually very hard to get. So even though you have the location data you haven't actually completely solved the problem and there are a few more problems sort of that comes along with that. So this is something that I'm currently working on. So like including how to exactly use some of the location data but that doesn't seem a straightforward process. That's the main idea. So what I'll do is I'll go back to assuming that there is no location data just because that's more concrete to work with and what I've shown you is a very bare one model but you can easily enhance that in any way you'd like. So if you want to include let's say the type of commuters. So some people are just commuters who actually know where they're going and their behave maybe differently compared to let's say tourists who might have very different preferences in terms of what they want to do with the system. You can actually just tweak your density model or the utility model to have different parameters based on user type. You can have based on intention of users be it a food run or some sort of exercise kind of trip or going to the nearest grocery store that can have different user level parameters based on that. You can include different pricing data. You can include heterogeneous parameters based on what kind of neighborhoods people are using from or what is the time of the day depending on that some of the parameters might change. So all of that can be very naturally included because now you have a good understanding of what actually each of the parameters really mean. Okay, this is about this user. They're originating and so whatever you want to tweak has a very clear place in the model. That's the main idea. So before I get to estimating this model and making some sort of policy design kind of questions I want to think about my estimates. Am I going to get correct estimates that I'm after or they might be misleading in some manner? And so that is what is called endogeneity in economics and it depends on what is your data. Like it's generally a good exercise to understand what kind of variation you are using in your data. So the first thing to notice that it's an existing system and then you are thinking of how do I change that system? So whenever this system was built the stations were not just located randomly in the city. There was some sort of a logic to placing stations in the city. So more often than this you would have been this scenario where there are more stations in the more popular areas be it let's say the downtown area where there are many amenities around. And so what how that would be a challenge to the data is basically if you look at these stations they have a high demand just because they are in more popular areas and typically users who have to access these stations have to walk smaller distances. So if I just naively use this data and fit my model what I might conclude is that it is these smaller distances which are causing these stations to have high demand but that is actually not the case. High demand is because these areas are just more popular to start with and that would actually bias my estimates if I'm not cognizant of what's really what really my data is and similarly about bike ability which is basically if you have stations which are really popular then a lot of bikes are checked out very fast and that would actually lower the fraction of time there are bikes available at that station. So what you'd see in the raw data is stations with very low availability have actually high demand and that would be a wrong conclusion to make that it is the low ability which is driving a high demand and so you need to correct for this reverse causality from the dependent variable to the independent variable. So we need to correct for these biases and so the way you can go about them is more or less two ways. So the first ideal way is to do some sort of a DB testing that instead of just relying on raw data you want to do this control manipulation of the system features. In this particular case it actually is a very expensive exercise to do this. So if I'm thinking about the station location to do any sort of randomized experiment well I'll have to include some new stations at really some random places. I'll have to allow some time to for users to figure out that there is a station incorporate that in their commute. I'll have to change all my inventory rebalancing policies based on that change that I've made in the system and at least as a first step that can be a really expensive thing to do. So what I have instead is some sort of modeling solutions for this kind of variation that you want to use in the data and so what I have his first thing is just to control for all these features that might just bias your system which is basically you show in the station location example that I gave before that there are some popular areas and so you want to include what makes them really popular. It is the metro stations etc etc so that your model attributes the high demand to presence of these other amenities rather than the smaller distances. On top of that what you can also use is this idea of instrumental variables which are which the main idea is basically this that find some variables which is outside your model that affects the independent variable X in in that you have and once you have X in the mall there is no place for including Z in the mall just because Z doesn't directly affect why but the way it can be useful is let's say this Z takes two values Z low and Z high that can actually make X takes value X low and X I and you can use that variation as some sort of a pseudo experiment to figure out how does X impact why so that's that's the main basic idea and then you can come up with many instrumental variables like in this example there are bikes coming in at that station which serve as a shock for the ability at that station and that can be used an instrument available to figure out the effect of changing availability on the demand at stations. So this is the kind of optimization process that I used to figure out my estimates in interest of time I'll not go too much into it but basically what I'm using is this idea of generalized method of moments which is just the generalization of the kind of objective function that you have for linear regressions or maximum likelihood estimates where the idea is that I'm just imposing this condition that these instrumental variables should be orthogonal to my error terms that have so let me skip this a bit one more thing that you have to be cognizant of when you're using a model like this is even though it's great because it's very granular and it's following the user what's really driving the demand there could be a lot of computation that you're doing to actually make this model run. And that is happening because what you have so I have this toy example here where I have 11 stations and these blue and red dots denote at every two minute level whether a station has bikes or does not have by so let me call this a system state a system state is basically this binary representation of zeros and ones for that particular time. And given that these are fast moving systems the system state can be changing over time very rapidly so no two type two minute time intervals look exactly same to one another so you will have to treat them differently and on top of that you have these lot of users who are originating at different locations in your model and so combination can be trillions or even more computations and on top of that if you're optimizing that system to figure out what your optimal parameters are that can be a whole lot of computations. So what I do for that is basically exploit this idea that there is this location aspect which is important in these systems if I'm thinking of the station f6 here. I know that if the station f1 is say two kilometers away then whether that station f1 has any bike available or does not have any bike available should not impact the demand at the station f6. So the way I can exploit that is basically as far as the station f6 is concerned I can just focus on the state of the stations that are nearby to this particular station and once I do that what I'll find is a lot of data points look exactly similar in terms of whether they have the bike available or not have the bike available and so that I basically build on this particular idea in terms of collapsing the data at each station level and that gives me actually a huge advantage like a thousand fold advantage in terms of the computations that I'm doing to estimate this particular model. So I'm not going into many details but they're present in the paper that is associated with the talk but that's the main idea using this location part. All right so now I am actually ready to estimate my model. So what I have is a bunch of different estimates for my parameters. I also do the validation just keeping some test data some training data and what I actually see is not only this small is sort of more intuitive captures the process very well actually performs much better than the other models that I were initially trying to construct the regression models choice models or some variations of that. So that's actually a good news. I'm looking here at our square or mean square error kind of a thing. So that gives me a little bit more confidence that this is a good approach to use in this particular setting. The main power now is since I've understood this process in how the demand is being generated. What I can now do is take my existing design and come up with a lot of counterfactual designs say I move stations around in any crazy I want crazy way I want. I say change my inventory management policy so stations have different bike availabilities and what I can now do is just run my demand model like let my users come at different places and now make choices for this new perturbed system. And since I know everything about how users are making those choices I figured out all those parameters. I can just predict what will be the demand for this newly configured system and I'm more confident because I focused on the causality part so that I know it is not just association but really if I change these locations this is how the system should behave. And so I can exploit that quite a bit. So the first thing I'm just showing you is some very first order understanding of what's really going on the system. So if I focus on a single user I think about how they think about walking distances. So what I just figure out is that they of course hate walking no one likes walking but the way that function is it it behaves in a little bit convex manner which is walking smaller distance is kind of okay so up to 300 meters the demand does not drop as much but after 300 meters or so users actually hate walking quite a bit in at least in the system the demand decays quite rapidly. So that gives some guidance that typically you want to be in the 300 meter range of the stations if you if I want to capture that particular user demand. So that's just a very first have inside which I wouldn't have gotten so neatly from some of the other models that I was thinking of again I can just plot where are my use how much distance my users are walking and again note that I did not have this data to start with I had no idea where my users are but having this model I have sort of this backtraced what kind of distances my users and I see again that a lot not a lot of them are walking more than 300 meters I can also trace out where are my where is my most of my demand coming from so it's good to know that a lot of it is coming from residential areas where supermarkets are where cafes are where metro stations are so that can again guide my design in where should I be locating most stations. So that's just the first order inside I have done nothing fancy but just to exploit this particular model let me just illustrate what else you can so so I have this use case here where so I'm starting with this hypothetical scenario that I have certain budget and that budget in this particular system would mean I can purchase some bikes or I can because that's kind of a huge capital cost in this kind of systems and I can purchase some dogs to where these bikes should be placed and typically bikes and dogs have a very proportional ratio like the number of dogs would be let's say 1.5 times the number of bikes so I can think that I have money for let's say 1000 bikes and once I have that I can think about distributing those 1000 bikes in the city in let's say these two different ways so one is I can have say 20 stations in my city but each of the stations would have 50 bikes so large stations but very few stations so that's this design on the left or I can have a lot of stations in my city but each of the stations would be on average smaller in size. So let's say 100 stations but each of the stations has only 10 bikes so the advantage of the system on the left is that each of the station is pooling in a lot of demand since they are bigger in size so we know that pooling has a lot of economies of scale and that would cancel out a lot of variability so basically what would happen is the system on the left each of the station would run out of bikes much less often so they would be good in terms of having higher availability of bikes whereas the system on the right we know that would have lower availability just because each of the station is smaller in size but since there are more in number they would be very good in terms of having smaller distances for users so essentially what is happening is the system on the right is very good in terms of the accessibility part and the system on the left is very good for the availability aspect and now if I'm thinking well which of the design is good design because in general it is good to have both your system to have higher availability and higher accessibility but in this particular case it's really a trade off you can really have more of one or more of another and to figure out what is a good design well it depends on how people think about these two aspects and well so far I have actually figured out how people think about these two aspects so what I can do is basically generate a lot of these scenarios and then use my this demand model that I know how about how people are sort of coming up in the system how people are making these different choices and that is what I exactly do I generate a lot of these system designs and I simulate my demand model and what I see is that this is where the status quo is what it is right now if I keep on adding a little bit more stations my demand actually goes up so this system becomes more accessible a little bit less available in terms of bike availability but still overall it is good for generating high demand and after a while just it doesn't pay off that much so basically this simple exercise helps me reconfigure my system design by guiding me that I should be probably placing more stations in my city at least a little bit more than what the status quo is right now and so these are the kind of counterfactual designs you can do right away just because you've understood a lot about what is driving demand at the stations so that's just one illustration of it so let me conclude here and probably I'll have a few minutes for questions so just to wrap up what what I've basically shown you here is a little bit alternative paradigm to the typical traditional machine learning model that we have where I focus much more on having a model that is more interpretable more causal and I've shown you that it is more flexible in terms of including a lot of variations that you'd like in your model it you it is good when you not just want to make prescription predictions but also want to make some sort of prescriptions or decisions on top of the relationship that you already encountered in this particular case it turns out that there are huge improvement opportunities which this model is able to perfectly guide you and there is one more thing which I want to just showcase is that since you're now understood user primitives which is how users think about walking distances you can actually take that estimate and use it elsewhere so for example if there is this company which is running this bike share system and it want to launch it wants to launch let's say motorbike sharing system there is no data whatsoever but still people are not going to think about walking differently just because it's a motorbike sharing station right so you can take this estimate plug into that model there would be still a few more things to figure out how people how much prices are people willing to pay etc etc but at least these basic primitives can flow from one model to another whereas that would be a hard stretch to do if you have a random forest model because there is no understanding of how people think about walking distances so that's the kind of beauty these models come up with so so let me stop here and if there is time I would be happy to take some questions seems like there is perfect yes please hi yeah so you were using you're focusing a model on dock bikes that have docking stations right yes so how much would the model change say if you have a dockless system and what would be the advantages and disadvantages of that so you mean at advantages and disadvantages of the dockless system in general yes so okay so first how would the model apply to those systems there would be some differences there would be some similarities so you would still think similarly about users starting from a particular location and then having some distance to walk to so those kind of features will stay the same the availability will become a little bit different because now there are no more this idea of stations which have bikes or don't have bikes but just that there could be bikes scattered around and you would have to walk some distances so it would be more like a continuous rather than a disc right exactly yeah so instead of having these hubs there to be more dispersed so that's the first thing second thing is how would I think that just goes beyond this model a little bit is how this is those dockless systems differ from these dock based systems so just for sort of information for everyone so the system that I'm looking at right now has these stations where bikes have to go at and what is happening right now is a lot of companies like oh for mobile sort of are coming up with these designs where a lot of technology in the bike itself so you don't need these dogs to park your bikes you can basically pick up a bike from anywhere by unlocking through your phone and drop them off literally anywhere on a footpath whatever you don't need a dog to dog for them to for your trip to end so the way that would differ is a little bit in terms of some model I said already how it would differ and those designs I'm looking that looking into that quite a bit working with a few companies but basically the idea is that the sort of the ending part of the trip gets more convenient just because you can't leave your bikes anywhere but also the fetching part of this whole thing becomes a little bit more difficult as well because instead of having this certainty that you have to go to the station and you'll most certainly find a bike there now you almost every time have to open an app and sort of look around where a bike could be near by me so that creates a little bit of differences in terms of how convenient that system is for a user or not but that kind of remains a little bit of an open question we have to see look at more data more closely thanks thank you yes we have my care I'll repeat the question that's it can I okay so we have a question here as well so can we just have time for one question so please take it offline I'll do that yes yeah just a couple of observations you don't have user location data but obviously you can see superimpose the population density you know of the city on your to use your model and the second was your model should be biased on availability I'm sorry accessibility because you can always increase availability by adding more bikes so don't you think accessibility is more important or is that how your model is configured okay so two questions so first is you're saying I can always model so well people are originating just by looking at the population data so that is one aspect which is if people live somewhere there would be a more reason that people would start their journeys from that and that is part of my model except that that only accounts for like less than 50% of the trips because people would often start their trips not from basically where they live but also they would take a metro trip and start a journey from there or they would go to a nearby cafe at some time in the evening and then would start a journey from there so all of those aspects actually play a substantial role in sort of driving the demand for the system so it has to be more than just the population data and so that's basically what's being built in in this particular system and the second question was about the availability part that you should always emphasize on the accessibility was availability is something that you can always tweak around well it costs money I mean these bikes especially in these dock based systems these don't come cheap like each of the bike in this relief system was about $1000 or so so increasing that number of bikes even though it seems like okay we should always be able to do it and sort of never allow any station to run our bikes that doesn't actually happen because it's just a very expensive investment to go about you make sense right well thank you so much and we'll look forward to the next