 Welcome, everyone. Thanks for joining. I'm Kellan Betts, a course lead in the MITx MicroMasters Program and Supply Chain Management here at MIT Center for Transportation and Logistics. Very happy to be co-hosting today with Laura Aleya, also a course lead in the MicroMasters program. Welcome, Laura. And then today, we're very excited to have Dr. David Carell, he's the co-director of the MIT Freight Lab here in the Center for Transportation and Logistics. And he's also a lecturer and research scientist within the Center. And so welcome, Dave. Hey, thank you for having me. We've joined our webinars in the past before. We like to kick things off with a poll. And so we'd like to know a little bit more about our audience. And so if you could launch that first poll here, you should see a pop-up there just within Zoom. And the question, well, that's getting loaded. The question here is just why you're here today. We'd like to know a little bit more about our audience. Some of the options here I want to learn more about machine learning applications. I'm in supply chain. I'm interested in knowing more about the use of technology to hit supply chain performance. So just a couple of options there. So if you could give just a minute or so to take a look at that poll and tell us why you're here today. And while we do that, my colleague, Laura, is going to go through the agenda for today's session. Thank you, Kevin. And thanks everyone for joining us today. We're very excited to host Dr. David Carell. So we hope you enjoy it. So during the next 13 minutes or so, Dr. Carell will discuss recent work at Freight Lab. His aiming to better understand and find how to solve track driver turnover. The outcomes that they are obtaining there at Freight Lab will enable companies to implement interventions and design strategies to help reduce track driver turnover or even prevent it. So Kayla and I will have some few minutes to ask some questions we have prepared. But we also want to save time for your questions at the end. So start thinking on those. Please remember that we will not go through the chat. You need to use the Q&A feature to ask your questions. And for doing so, be sure to be logged with the name because we are not going to read anonymous questions. So let's check the poll. Let me end it and see why you're here for. So probably you're seeing the results today. Right now, it's as most of you want to learn about machine learning applications and supply chain in general, which is great. So probably Dr. Gorelbo is going to address that interest. And they also want to know how to use technology to enhance supply chain overall. So let's do this. I'll stop sharing. And with that, Dave, are you ready to kick it off? I'm ready. Thank you very much. Should I start up? Yeah, go ahead. Gosh, well, thank you very much for having me. Thank you to the many students who have joined. I've just seen in the chat how many of you are joining and how you're from so many countries of the world. It's really a great honor to get to contribute again to the MicroMasters program. As many of you probably know, we just recently celebrated our MicroMasters graduates. And I got to meet some of them. And so many people said that the program had changed their life for the better. And I was thinking about that and how, well, that means that the students gave a lot of time to earning their certificates, but also that the staff, like Kellen and Laura and Ava, gave a lot of time to making a great program. So there's not a lot of pursuits in life where you get to know that you're changing people's lives for the better. But the MicroMasters is one that we all get to do. And it's a great honor to be a part of it again. And I want to say hello to Kuang. I saw my friend in the chat. And maybe there's some new friends in the chat too. OK, so just like Kellen said, my name's Dave Krell. I'm a research scientist here at CTL, the group that brings you the MicroMasters program. And one of the jobs I do there is I work with your friend in mind, Dr. Chris Kaplis, on the Freight Lab Research Program. That's a research program that he started over 10 years ago. And we spend a lot of time looking at issues related to truck driving and how trucks carry freight between point A and B as part of mostly American supply chains. But we have some global experience as well. And I'll try to bring some of that to today's conversation. I want to start with a motivating question. And I'll share with you what I think the answer is. And maybe we can talk more about it in our discussion. But are there enough truck drivers? And I know that's kind of an abstract question. But for years now, certainly in the United States and around the world too, I think people have felt that there are not enough truck drivers available to generate the loads or to carry the loads generated by the world's economies. So it just is some evidence for you. You know, in the United States, here's an article from the Wall Street Journal, the business newspaper of record here. This is from 2021, so relatively recent. And they say, where are all the truck drivers, the shortage adds to delivery delays? And in this article, they're explaining that really the performance of American supply chains is suffering because of this arguably weak link in the chain that when we need to move something on a truck, there are not enough truck drivers available. Now that's from 2021. I can go back one more year. Here's 2019, Bloomberg, another widely read business media outlet in the United States. They say the truck driver shortages on course to double in a decade. So this is a problem that has been around for a while and everyone who studies it believes without intervention, it might get worse. I could go back even further to 2019. This is from the New York Times, probably the United States' most widely read newspaper. And they write, what does a truck driver look like? It's changing amid a big shortage. And this was an article about efforts to change the demographics of people who typically try to become truck drivers to try to have a bigger applicant pool. I bring all this up to say that if you're studying modern supply chains, particularly in the United States, and you're looking for a way to be helpful, one of the problems on everyone's mind is the feeling that there are not enough truck drivers available. Now as I was getting ready for this call and knowing how international our audiences and the MicroMasters, I was looking around for examples from around the world, and there are some. So this is from the New York Times this year, how a crisis in truck driving could change life in Japan. They were explaining that same thing that we were talking about in the United States this year that supply chain performance, no matter how well designed, no matter how much money we throw at it, no matter how much we optimize it, supply chain performance cannot live up to its goals if we do not have truck drivers available to move the freight through the network. And it goes even deeper than that too. This is a study from the IRI in Europe where they were looking at truck driver shortages and economies around the world, and I was really quite excited to see, well, it's a problem, but it's interesting to me to see that it's a global problem. They're looking at the number of unfilled truck driver positions, 9% in Mexico where I know we have some students, 11% in Argentina where I know we have some students and some excellent course leads. In Europe, we see a similar problem in parts of the former Soviet Union, and even in China where I know we have a lot of students. So this idea that there's a shortage of truck drivers relative to supply chain needs, I think is a global problem, and that really gives us motivation to bring whatever we can to the table as part of the solution. The way that we approached it in the freight lab is, well, could we think about retaining truck drivers? Part of the problem in the United States is that companies that employ truck drivers oftentimes have over 100% turnover. Industry average is in the high 90s. So that means that every year, the managers of truck drivers are managing an entirely new group of people because so many people quit every year. We thought, okay, maybe that's an opportunity for a researcher like ourselves and like you all who are learning all of these quantitative methods for supply chain management, maybe there's a way we could be helpful in zooming in on this problem. How can we keep the truck drivers that we have? This was an especially good time to get into this because we started thinking about this in probably 2019, started working on it more in 2020. And part of the reason it was an opportune time for us was that the laws in the United States changed at exactly that time. And they changed in a way that requires all truck drivers to use what's called an electronic logging device or ELD. So here's a picture of one. This is the truck driver. He's in his truck and every minute that he works has to be recorded on an electronic log. So when he starts up the truck, he'll have to hit a button that says, okay, I'm driving now. When he takes his break, he'll have to hit a button that says, I'm taking a break. When he's done for the day, he enters that into the system. This machine is also tied into the engine computer. So for the most part, people believe it to be pretty fail safe against cheating. Although that's something we can talk about if there's time in the discussions too. The reason the truck drivers were made to start using this device was originally a safety motivated reason. Over here, I've plotted the laws for working hours for truck drivers in the United States. According to US law, a truck driver can work for 14 hours in one 24 hour day. And of that 14 hours, they can drive for 11 hours. So in any 24 hour cycle, the maximum a truck driver is allowed to do is work for 14 hours, 11 of that can be driving, and then they need to take a 10 hour continuous rest. This was originally implemented because people were worried that truck drivers were profit maximizing and driving without sleeping and making the road zone safe. Historically, these rules or adherence to these rules was measured by paper logs. So the drivers would have to fill out their hours of service, how many hours they were working on little pieces of paper. This new technology makes it a lot easier to monitor both for the authorities, for the truck police, but also for the driver. So here's a dashboard of what an ELD would show the driver. It would say, okay, you have 12 hours and 55 minutes left of legal work time today. You have 10 hours of legal drive time and you haven't taken a break, for example. We can talk about in the discussion, some people love these things, some people hate these things. For a researcher, it's gold because it means if I'm starting from the position, I want to understand why truck drivers are quitting their job so much. Now I have digital data about how they spend their time. I have actual digital measurements of where their time is going. And that turned out to be tremendously valuable to addressing this problem of the truck driver shortage, but maybe not in the way you think. Let me get this over here. Okay, what I'm showing you now is a summary of electronic logging device data, ELD data, for around 1,200 truck drivers over four years. And we took two particular measurements from the logs. By day of the week, how many hours on average they spent driving? That's in blue. So on Mondays, on average, each data point would be one specific driver of the 1,200, got between a little under six and a little over seven. This would be the middle 50% if you're familiar with these box and whisker plots. A high-end 25% reaching almost 10 and a low end at around two and a half. We did it for every driver for every day of the week. So Monday, and then I put the numbers here so you can see them too. So Mondays, they got 6.25 on average. Tuesdays, seven hours driving. Wednesdays, 7.2. Thursdays, 7.2. You can see it visually too, Friday, Saturday, Sunday. We also measure the standard deviation of those as sort of a measure of variability by day of week. So you can see here just by visual inspection, the weekends are a little more variable. These boxes are bigger. And we also looked at what's called on-duty hours. That is, if you were paying attention to the rules, the difference between 11 and 14. So what's a truck driver doing if they're working but not driving? Typically, they're sitting somewhere waiting for their truck to be loaded or unloaded. So we measure the average of that time by day of the week too. A few things really shocked me about this chart and ended up giving us a lot to share with the world. The first thing, everyone gets around six and a half or seven hours per day on average driving. And there seems to be sort of a pattern around the days of the week and the weekends. Let me show you why I think that's so shocking. So according to the law, I'll call it the de jure cap. I'll use the fancy legal term for the law. A driver in the United States carrying freight can drive for 11 hours. But we've looked at the data I just showed you and in other studies, now we've looked at other companies and got the same results. In reality or de facto, they drive about six and a half hours. The problem there is how can something be scarce and underutilized at the same time? How can we say that we have a shortage of drivers? The crisis is that we don't have enough drivers, but the drivers that we have are not working all the possible hours. Similarly, if you were to say to me, my town does not have enough restaurants, well, then I would expect that the restaurants that you have are completely busy all the time. Something must be broken if in the midst of a shortage, the short resource is not fully utilized. So we knew that there was something broken, perhaps something even fixable once we saw this difference. Another way to think about the difference is let's say there are 1.8 million, what we call class eight truck drivers in the United States. That just means employee truck drivers, people who when they filled out their taxes said I work as a truck driver around 1.8 million. The current estimate of how many drivers short we are comes from the American Trucking Association and they say 80,000. So then I say, okay, well, how short is that? 80,000 into 1.8 million is only 4.5%, 4.4%. This really reframes the question then, what if we don't need to hire people? What if it's not a shortage of drivers? It's just that we're not getting enough out of the drivers that we have. What if instead of hiring 4.4% more drivers or 5% more drivers, what if we just gave all of the drivers we had 5% more driving time? That would work out 4.5% added to the 6.8 hours per day would be 0.2 in hours, 18 minutes. So now just to put the problem in a sense of scope and scale, we're really only talking about trying to find 18 to 20 more minutes of driving time every day for the drivers we already have to alleviate this thing that has been in the headlines now for years that we perhaps misdiagnosed as a shortage of drivers. Maybe it's not a driver shortage, it's a utilization crisis amongst the drivers that we have. This turned out to be a pretty controversial thing to say, but that was exciting for us to get to share it with the world. I was lucky enough to get invited to the US Congress to share these ideas for hearing on supply chain malfunctions in the United States. After that happened, much to my delight and great surprise, trucking companies started messaging saying, we looked at our data and we got the same thing, 6.5, some people got 7.1. This notion that our truck drivers are not availing themselves or not utilized fully to the extent that the law would allow seems to be true all over. Something that I'd like to point out is that, sometimes when we think about, well, how is this happening? And we'll talk about that. We've looked at a lot of different ways and I don't think it's the truck driver's fault. I think a lot of the reason that we are leaving as this headline from Markerwatch put it, 40% of trucking capacity left on the table every day is that we are not efficiently processing them at their pickup and drop-off appointments. So we'll get into the machine learning here. I know that that's what many of you are here for as we think about, okay, we wanted to use the digital workloads to understand what a truck driver's life is like. It turns out it's a lot of waiting, maybe too much waiting, given that we feel we don't have enough truck drivers. It also turns out that that utilization issue that we stepped into really seems to relate to retention. And I'll show you these results, but one of the biggest findings we have from the machine learning application to this data is that if you ask me, how do I keep a truck driver? You need to utilize them more and I'll show you how we got to that conclusion. So across the top here, I've put the steps of our study and I'm gonna walk through each of these steps for you, because I know that many of you are excited about doing studies like this yourselves. So the first thing that has to happen is the drivers collect the data. So here's our driver in putting the data. Now, one of the unique things about this study is that I was very lucky to get access to this data. People are very hesitant to share ELD data. I was lucky enough to get some, but it was recorded independent of my direction. So the way it was recorded was they recorded it from September, 2016 to November, 2016. Then they stopped, started up again February, 2017 to March, 2017. Then they stopped, started up again March, 2018 to April, 2018, stopped, started up again June, 2018 to August, 2018. Somewhat strange batching of times, but that was the way it was done. So then I started to think, okay, well, what can I do with this if I want to help the world understand how to retain valuable truck drivers? Well, we could do two experiments. Who was here in time one? Who was here in September to November, 2016? But then doesn't show up in the data set in February to March, 2017, experiment one. Experiment two, who was here in March of 2018, but wasn't here in June of 2018? Now I can use, and I think you'll learn about this in SE4X, labels where I can label my drivers, someone who stayed across T1 and T2 and someone who left across T1 and T2, label someone who stayed across T3 to T4 and someone who left before T4. And I can start to train a machine learning algorithm to identify from the electronic logging device data who are the people who are likely to be gone in time T plus one. As we started to look through the data and we started to think about the kind of features that might be helpful, what really stands out, and I alluded to this before, is how many hours they work. So what you can see here is a histogram of drivers, and this is in the second experiment, who are present in T plus one in blue, absent in T plus one in orange. So present means they stayed with the company, orange means they quit. If you've never seen a histogram before, the way I like to teach it is it's just a pile of bodies. So this is the pile of bodies at 40. People who worked 40 hours a week in orange. In blue is the pile of bodies of drivers who worked closer to 50 hours per week. The first thing we notice is that if you stayed with the company, you worked more hours. On average, you drove more hours that week. If you stayed with the company, then if you left. Now I know that this is of interest to this group. I'll be a little bit technical. We compare these means using a man-Whitney comparison for non-parametric means. You can see it visually, but the center of mass is not the same here. There's a statistically significant shift to the right. Drivers who stayed with the company got more driving hours. These results are available in this working paper from 2021 as well. Let me look at it even more. If I do the same thing, but now I do it by day of the week, we see the same results. The blue is taller than the orange. The drivers who the company was able to keep got more hours on Mondays and more hours on Tuesdays and more hours on Wednesdays, Thursdays and Fridays. Every day was statistically significant, except for Sunday, that drivers who drove more on the days of the week, and you can see here, will show up in the results, particularly on these high hour days, stayed with the company. Drivers who drove less were the ones who left. Okay, so we have our data cleaned. We're engineering our features around average driving hours by day of the week. We're also gonna use the standard deviation of those hours by day of the week. We need to do a little bit of cleaning. Our first step is we looked at what are only called solo, what are called solo long haul drivers. That's a driver who works by themselves and drives long distances, deleted any suspicious records. You know, if the log read that the person worked 300 hours in a day, we knew something was wrong and deleted that. We only looked at our driving and on-duty hours. And as a result, we had 1,298 clean and complete drivers to study. I put these logos up here just because I know these are the tools you're working with. Python and Pandas and MySQL, they really work. It's worth the investment of your time to get comfortable with them. Okay, so we've got our data cleaned. We've got our features engineered. We're ready to move forward. There's a lot on this slide, but I feel like for those of you who are interested in the nuts and bolts of applying machine learning to supply chain questions, it was useful to be this specific. So I'm combining a lot here, but the first thing we have to do, and I think I'm sure this is covered for you in SE4X, we need to think about training and test sets. So we're gonna have a training set of data. Here we have all our data. And we're gonna say, okay, some of this is for training my model. Some of it I will reserve and I will see how well my model predicts on data it's never seen before. So we're gonna have a training and test set. This image looks like an 80-20 split. So all the data gets divided. 80% of it is for training the model. Don't let the computer see this 20%. At the end, test the model on here and see how accurate it was. That's the idea. So we start there, but then I got a little bit fast and loose and I'm happy to talk with you all about this decision ultimately got published. So it can't be that bad. But it was a time saving step where in essence I did this twice in this particular study. So one of the things that happens with machine learning algorithms is that they need to be tuned. So let me show you what I mean. We've got our data training test. We decided to use three classification algorithms to predict if a driver will be present in T plus one or not, if a driver is going to quit or not. We chose logistic regression, random forest and support vector machine, just based on past literature that tried to do this kind of work. Those were all strong performing algorithms. Logistic regression, I think you'll probably have some familiarity with. That's essentially an ordinary least squares regression type tool where the output is not a likelihood but a label. So it builds from the likelihood a binary classification quit or not. Random forest is more of the machine learning side a completely different approach where the data or the algorithm will essentially try to slice your data up into a series of decision trees. Did the driver work more than four hours on Tuesday? Yes. Did it work less than seven hours on Saturday? No. And we'll find different ways to climb through the data with those yes, no tree statements to make a prediction. Support vector machine. We tried it because I saw some research that showed that it wasn't a pretty powerful algorithm for this type of application. It was okay for ours. It's very computing intensive, so it takes a lot longer to run but it essentially, and I know this will sound technical, plots all of your data into n dimensional space and finds a hyperplane that divides that data most cleanly. All that means is if this is all the people that quit and this is all the people that didn't quit, the support vector machine sort of finds a way to slice right through the middle of that data to make a clean prediction. The thing about all of these algorithms is that each one has little dials and knobs, if you will, that allow us to tune it. This is called tuning the algorithm to sound really fancy. You can say tuning the hyperparameters. For random forest, they have names like maximum depth, maximum leafs per node. For support vector machines, it's the kernel, the C regularization, the degree, the gamma value. Honestly, these are things that are better for a dedicated course on machine learning. So that was a long buildup to why I did it the way I did. I don't care about those things, I just want the best one. And I don't have any reason to believe that a certain maximum leafs per node per tree is any better than any other in my random forest. Similarly, I don't have any reason to believe that any hyperparameters of my support vector machine are better than the others. So what I did was used a package called OpTuna. So all of these algorithms are available to you through the Python package scikit-learn. OpTuna is another package that you can say, okay, let's divide the training data again into a series of experiments. And I will tell the computer to experiment with all the hyperparameters and just give me back the best one in a new five-fold cross-validated breakup of the original training data, a little bit intense. But what that means is I'm going to take this training data, I'm gonna divide it up again into new training and test sets and I'm gonna ask the computer to experiment with all of the hyperparameters and just give me back the best. So then when I get back to my final test, I am only testing a tuned prediction algorithm on the fully reserved test set. So there's a lot there. For those of you that are in this world, maybe that made sense. Hopefully I didn't jumble it too bad, but the bottom line is you always need a training and a test set. What people sometimes forget is that all of your classification algorithms can be tuned to even further improved performance. For my purposes, I just want whatever the best tuning parameters are. I don't have a theoretical reason to adjust them manually. So I use a package like OpTuna where I can say run 1,000 experiments, try every possible tuning and just tell me what's best. Okay, we do that. So now we have our data clean, our feature engineered. We've set up our classifiers, we have our training and our test set set up. I've tuned them. So now I have the perfectly tuned algorithms and I will apply them to the test data. We need to know how we did. I wanted to call out a couple of measures that we used and you can see here, this is as you may have seen before what's called a confusion matrix. My candidate for the worst named thing in mathematics and analysis. We need it though, because it helps us understand how we are evaluating our classifier's performance. So for accuracy, we have A11 plus A22 over A11 plus A12 plus A21 plus A22. What does that mean? That means A11, what did I get right? What did I say turnover and did turn over? A22, what did I get right? What did I say would not turn over and did not turn over? So that's the numerator here. Over everything. So you can think about it as the fraction of the percentage of right classifications over all opportunities to classify. That's a pretty popular way to understand these things. I also wanted sensitivity, which is a little bit different measure. So that's saying, what did I get right? What did I say turn over and did turn over as the numerator? And the denominator here is, what did I say turn over and did turn over? Plus what actually turned over, but I didn't catch it. So I like, you can either call it sensitivity or recall. To my mind, sensitivity is the more intuitive word. It's how many of the real true positives did I actually catch? You can think of this as especially important like disease prediction. We really don't want to miss a lot of true positives. So we looked at those two and let me show you how we did. So in single experiments, I've got the accuracy and the sensitivity for logistic regression, random force and support vector machine for experiment one. That's, could I predict who was in T1, but not in T2? And experiment two, who was in T3, but not in T4? Accuracy wise, mid sixties, upper sixties, almost 70%. Sensitivity around 50%. So you're probably reading it the same way I did with some sadness and some disappointment. Of course, we wanted higher scores, but I think we can speak to why these scores were at least enough for us to get published as a first step in this effort. Let me show you these results one more way because I think you'll see them in SE4X. This is what's called an ROC curve. You have your true positive rate and your false positive rate. So true positive on the Y, false positive on the X. Essentially, this is a way of seeing is your prediction any better than a random guess? Because a random guess, if I'm just randomly guessing is the coin heads or tail, I would have an equal likelihood to get it right or wrong every time. So anytime your ROC curve bends over that diagonal line, you've done a little bit better than random. And I would honestly say that's about how well we did. A little bit better than random. Not great, but certainly not a perfect prediction either. A perfect prediction ROC curve would come right up here. Another thing we did just to see this is all just for one time experimental results. I'd also note here, and we can talk about it. You know, the red, the blue and the green are the different classifiers and they all perform about the same. One of the things that we also did as sort of a robustness check is we did another cross-validation where we said, okay, let's try the tuned parameter on five different runs and see if our accuracy and our sensitivity or our recall are about the same in five-fold repetition. And they really are. So here we can see logistic regression, random forest, support vector machine for accuracy across five-fold cross-validation. We pile pretty neatly in that 65% for our recall, same basic story. This Friedman test down here is just a test to see is one algorithm outperforming the others? And I really don't think so. So we've evaluated the model for its predictive power. We have one more thing we can do. We can see what we call post-talk analysis. Can we understand how our algorithms got to the classifications they got? The tool that we use for that is something called a Shapley value. It's a really cool tool, again, a package that you can get open library in Python that takes this game theoretic approach to understanding what each features contribution to the classification was. Another way to think of it is like, if the basketball team won the game, how many of those points can you attribute to each player? It's easy if you only count the buckets. I'll use a basketball analogy. If you only count the guy who shot the ball, easy. But everyone knows that basketball or most sports are more complicated than that. There's people who are passing the ball. There's people who are defending. The thing about machine learning algorithms and why they're powerful is that they can pick up on the interactions between different features, which means it's actually quite difficult to understand each feature's unique contribution because they could be working in unknown cooperation with one another towards the prediction. So think about the random forest. Did Dave drive more than four hours on Thursday but less than three hours on Saturday? Thursday and Saturday are working together there. How do we say who's more important? It's like a thorny problem. So Shapley values and the people who created them have won all sorts of awards for it are a mathematical way to estimate each feature's unique contribution to model output. I'm gonna make ours bigger here. You can see what the Shapley values do is this is called a waterfall Shapley plot. It gives you from top to bottom most important feature to least. And it's a really helpful way to understand that sometimes all of your prediction power came down to one feature. In our case, it wasn't really that way. It's a pretty even waterfall down across the three algorithms. And what we can infer from it, at least at this point, I think is a little bit limited. So if I look at my random forest, most important was the mean of how much time driver was made to wait on Mondays. Next, the mean on how much time a driver drove on Wednesdays. Next, the standard deviation of how much time a driver drove on Saturday for the random forest. If I come over to the support vector machine, Tuesday's wait standard deviation was most important, followed very closely by Friday's mean time driving, followed very closely by Tuesday's mean wait time. If I look at logistics regression, Tuesday's wait standard deviation, Monday's drive standard deviation, Friday's wait standard deviation. A couple of things to take from this. You know, one, we don't have a clear culprit. We can't say, you know, it's all about Wednesdays. It seems all the days of the week matter. But I think also, you know, there was some internal conversation about should we use the standard deviations or just the means and the standard deviations play a role. So we can take from this that, you know, not only does the amount of time that a driver gets to drive clearly important to predicting if the driver is going to leave, but also the variability around those experiences for the driver contribute to our understanding of if they're going to leave or not. Okay, a few conclusions that I'm excited to talk about all this with you all. We got 60 to approaching 70% turnover at the individual driver level by applying machine learning to these digital driver work logs. Believe it or not, that really hasn't been done before. There's lots of people who have tried to predict aggregate turnover. So they could say your company will experience 65% turnover next year, but they can't tell you who the people are. Our model actually pinpoints, these are the drivers at risk to leave, which enables managers the opportunity to try to talk to that person, to try to intervene, to try to keep them. The other thing is that if people didn't know, I feel like we have a growing library of resources to say if you want to keep your truck drivers, you have to keep them moving. The surest way to lose them to achieve that 100% plus turnover is to keep them sitting still. They took this job, they know it's a hard job. They took it to keep their wheels turning and be making money. They clearly don't like to be stuck sitting around very much. Couple of broader implications. This whole tool of the ELD, it was a regulatory burden. It was something the government forced on the truck drivers. I was at a bar in Boston when the law came out and a trucking CEO basically just said to me, Dave, they're gonna make me buy these things. Can someone make them useful to me? And it turns out they are really useful, but they weren't intended for this purpose. We were able to find that use later. But the bigger one in where I've been trying to share this with as many people as I can, I appreciate this opportunity to share it with you. I think we've misdiagnosed the problems in our supply chains. Our supply chains are sick. They went to the doctor. That's right, there are problems. But to diagnose it as a headcount shortage is a misdiagnosis. I think it's really a utilization crisis. We have enough truck drivers. We are just squandering too much of their time and that's why it feels like we don't have enough. So with that, thank you very much. It's just an honor to be a part of this program again. And I hope we have some time to chat a little bit. Thank you all. Awesome. Well, thank you, Dave, for the insightful presentation for sure. I have a couple of comments and jumping to questions here, but I wanted to just kind of think the idea of sitting in a vehicle, not say a truck, but just any vehicle. I think maybe that's like part of human nature. It's super tangential, but like I know I don't like sitting in traffic so maybe there's something in our human nature where we just want to, when we're in a vehicle, we want to keep moving. We don't want to sit around and wait, sit in traffic, sit at, you know, warehouse waiting to be unloaded, whatever happens to be. I don't know, maybe there's something there in our human nature. And I also want to thank you for the details, but also very easy to understand presentation on some of these really kind of complicated topics. So you're going into things like cross-validation and random forest and your shapely value and you're explaining some of these really complicated topics and echoing some words here that are in the chat. Even from some of our CTAs here in the chat as well, but you know, you definitely explain some of these really complicated topics in a way that makes it very approachable. So I definitely appreciate that for sure. A lot of powerful tools there that you're leveraging. So awesome. So we have a bunch of great questions in the Q&A. I want to start with maybe one that's related to something that we kind of had pre-prepared but also expanded on a little bit and then maybe we'll jump into the, and we definitely keep thinking of those questions and put those in the Q&A feature there and we'll try to get to those as well. But it kind of may just kick things off. I want to kind of start with a focus a little bit on the tools, you know, some machine learning there. Obviously there's other approaches, you know, you did logistical regression, which is a little bit more like less machine learning type of approach. There's just kind of basic statistics that you could have, you know, approached this problem with. And I'm also wondering if you, so part of the question is, you know, why machine learning? What did machine learning bring to the table? And another part of the question is, what do you think was maybe some of the constraint on the accuracy with those machine learning models, you know, was it the features you had there in your data set? Was it the size of your data set? I know, except we're all familiar with ChatGPT and some of these machine learning tools which need massive data sets in order to really be accurate. And so I'm wondering kind of what your thoughts are, your initial thoughts are on what was constrained the accuracy of those machine learning models you built. Well, gosh, thank you. Thank you. Great question. I think it was the data we chose to use. So I say that because we've done a couple of these studies now and our data didn't capture some of the important features. So it turns out from another study and hopefully I'll get to come back with you guys once that one's done, that drivers who just started quit more often and for different reasons than drivers who have been there for a while. And our data didn't have how long that driver had been with the company yet, which I think if we would have had that feature by itself, you know, we could have got another 15% accuracy. Another one that we don't have that I just need to work harder to get it in there is some places for truck drivers to visit are more annoying than others. I think if we could have counted that we would have had some more accuracy too. So that was a long way of saying the biggest hurdle to breaking 60% accuracy I think was that we were working with a very limited feature set. You know, how many hours did you drive and what was the standard deviation of those hours? If we had more features, I think we could have got higher. Oh, and then you asked about logistic regression versus the machine learning tools. You know, usually I find, and I feel like many people do, that the machine learning is going to be more accurate but harder to interpret. And so we kind of thought, all right, well let's try logistic regression. Even though maybe we'll be less accurate, maybe there'll be some interpretation bonus there. And interestingly for this one, logistic regression was really equally as accurate to the other ones. Thank you. Thank you Dave for sharing the technical details. And there are so many questions about the predictors and why you did consider or not certain things. So thank you for bringing this additional note on the kind of information that would add to increase accuracy or probably an extension of your research practice. I want to take a minute to go to the more managerial insight because actually, as you just said, when you got to the conclusions, you made this to be very precise in identifying who is more likely to leave the company. So you are super clear, like this is not an overall, it's just per person. So we got a couple of questions I would like to bring here to the floor just in case you have gone there or if you're planning on exploring that in the future. One of the questions was from Renan about have anyone actually had a conversation with track drivers? Is there any quality portion of the analysis in terms of why is this not driving time happening? Would they want to drive more? What's different? What's about the idle time? What do they see or feel needs to change? And the other question, and sorry for bringing too, but they're on the managerial insight probably also, is about the ethical practice in terms of how to treat the information because you're actually identifying someone and telling the manager, this person is likely to leave the company. So how do you treat that if you have gotten there yet and what is the best practice you have noticed within the companies? Oh gosh, thank you. These are really good questions. So on the sense of do we talk with the truck drivers? That is a great question. And we do for sure, as much as we possibly can. So we do have some sense and just to give a short answer, two ways to look at that I guess. One, the short answer is I think people, I was worried that drivers would not like what we were doing because we were sort of poking around records of their lives. I've not found that to be the case, but the opposite. A lot of the drivers I talked to say, thank God someone else is talking about how much my time is being wasted. And also, we have insights into, and you all do now too with your data mining skills, you can look at thousands of observations of real people's lives. And that gives so much more power to say, this is a problem that if someone just says, hey, I was made to wait this long. I recently published another piece in supply chain quarterly that if we're in contact, we can share with you all that's for free online, that features interviews with the truck drivers as well as some perspectives on how dwell happens. So I encourage you to reach out to us if you can't find that, that's called Are You Your Trucker's Keeper in the Q1 edition of supply chain quarterly. The quote that I'd like to bring to you from that, if you don't get a chance to read it is from a truck driver I love talking to, Longhall Paul, if he's on this. And he said, the worst part about my job is it makes a liar out of you when you didn't wanna lie. And he tells the story, he tells the story, I'll tell my daughter, I'll be home for your birthday party next week. And then he says, some receiver, some knucklehead and Kenosha doesn't have the cheese on the palette in time and you don't make it home. And so we do hear from the drivers and they talk about the detention too. On the managerial side, and I guess I'd like to add to that question, and this relates to another project. So I have another project I hope I can bring back to you soon specifically for beer delivery drivers. And it's been very interesting to do. And, but in both of those, we're trying to identify drivers who might quit and then you bring up the ethical components of that. The other thing that came out of that and kind of getting to the ethical side of it is that I thought I was helping truck drivers but I think we're really helping middle managers. And I say that because, if I'm someone who manages 10 to 20 truck drivers and maybe I'm new to managing people, I don't know who I need to put extra time into keep and I don't know how to start the conversation if I'm a new manager of people. Our model will basically say, here are the people you need to talk to this week because they might quit and here are the reasons we think they might quit. So it really sets up a conversation that's of assistance to middle managers. It gives a voice to truck drivers but it helps the people managing them too. On the ethical side, you bring up a great issue and I wanna point out a book by a professor Karen Levy called Drive Time where she recently published a whole book on ethical issues. In this, I would say to my end, particularly in this study, we didn't know anything about demographics. We didn't know, in many cases, well, I never know their real names. I know only their employee code and some companies when they give me data actually translate that code into something unreadable. So my code exists only for me and I could never transfer that back to a real person. So on the ethical side for us, I haven't encountered anything too concerning because we're so deliberate to make sure no actual identifying information is available. Awesome, thank you for the insights for sharing. I definitely have to check out that book too. Drive Time you said. Yeah, let me make sure I got it right. Levy is the author. If you guys didn't know, Kellan and I are always talking about books. Oh, sorry, Data Driven. Data Driven by Karen Levy, L-E-V-Y. Awesome, awesome. Well, thank you again for the insights as they're kind of bringing your voice to truck drivers is very interesting and also just the insight of how you're supporting those middle managers and supporting their drivers and keeping their drivers and trying to solve this issue. There's again, tons of awesome questions here in the Q&A. So thank you for all those questions. I'm gonna just pull one here. There's probably way too many that I can't even sort through all the great ones but I'm just gonna pull one here from Amrindir. He has kind of an interesting question on the intersection of, so you build these machine learning models to make this prediction on turnover. On the other side, there's lots of tools and transportation that are meant to optimize things. So optimize allocation, optimize routes and these kinds of things and something just kind of kind of generalizes question on what is your experience with the interface of these two different approaches or these two different tools, predicting things and then optimizing things. Have you think your models or this predictionist idea of predicting turnover could then be integrated into some of these more automated tools? You talked about supporting those middle managers but there's also sometimes these automated tools that are doing the scheduling or whatever. So maybe your experience on that interface there with the opportunity there. Oh, that's a great insight, thank you for that. My belief is that it doesn't exist yet but I agree with the questioner that it should and thinking about a few particular results. So from another study that we did with similar data, we were just looking at what predicts how long a driver has to wait at a facility to be loaded or unloaded. So that directly relates to the utilization that we've talked about so far. One of the strongest predictors was how many times has that driver been to that facility before? So if it was their first time, they might expect a three hour wait, second time, two and a half and then it really fell off with five or more visits. And so there's a lot of ways to try to understand how that could be but some of them are really practical. Like the first time you get there, you don't know which entrance is the right one. You don't know where you need to pull up to and who you need to speak to to let them know you're there to get your freight appointment. There's some practical solutions. And then the softer side hypotheses about that observation are maybe the more you go, the better relationship you have with the people there you're allowed. You can get yourself in more quickly. We don't know which one it is, but we know it's true. And I bring that up in that I don't think freight scheduling systems ever sort of add the constraint or add the incentive to not send a driver to all different places or I should say differently. I've not seen it where it's built into the optimization try to send the driver to the same places over and over. It's really more optimized about overall network minimizing overall network cost, not maximizing driver visits to familiar locations. So that's part of it. And I think the questioner has a great insight if we could build that in. And frankly, I think companies would be interested in it because turnover is such a big problem. The other way I'd like to see it integrated is if you think about like, think about Long Hall Paul's quote about not wanting to be a liar. You know, he didn't say he didn't mind waiting. He just said he wanted to be able to tell his family, I can be home by the state. So I think sometimes if facilities, you know, say they do take a long time, I think that's fine if they would of course we want to be faster, but if they would be open about that, if they would share that data, you know, if you come on a Thursday, it might take six hours, then the network can be optimized around that real weight, not a fictitious estimate of how long it will take. And it's really the same thing as like if you try to make a dinner reservation on Google or you look up a restaurant on Google, it will say this restaurant is about this busy at this time. I think if our network planning had that level of accuracy, we could really lower the standard deviations on drive time and wait time in our model, keep more drivers and get better predictions. So I think question is right on. Thanks, David. There are so many questions coming in. So we are sure we're not making it to answer all of them, but for sure they can contact us or you through the website probably. So Emma just share the freight lab website on the chat just in case. So in the meantime, there are so many questions about extending your research to Europe. So I was wondering if you have considered such options and what are the things that could change? And they are already guessing because everyone here is jumping in and sharing their thoughts. Maybe the fact that compensation in the US is probably by hours and then in some other countries regulation force like monthly or weekly or bi-weekly payments. So what are your thoughts on that? Oh, my first thought is I hope I can read that part of the chat. It sounds like there's some interesting conversation there. I would love to do this in Europe. It's really we only recently learned that the problem is equally as big there. So we really weren't thinking about it as much. So yeah, the way truck drivers are paid here I think it's very strange. It is like Laura said, it's per mile. So there's usually a negotiated rate or sometimes there's a market rate but it's something like $2.50 per mile driven. So you can really understand then why drivers want to drive more here because if they're not driving no one's paying them for their time away from home. You know, I'm not sure what the rules are in Europe or around payments in around hours of service. So that might be different. I have had European audiences when I've shared this say, you know, and maybe this is what's already in the chat is that it's a little more complicated in Europe too because you can have drivers whose home is in one country and their company is based in one country and they might have the rules of that country taking freight across borders where the rules are different and the reporting is different. So that could be hard, but I'm up for it. Let's do it, that'd be fun. So thank you. I know we're kind of getting close to our time here. I don't know if we want to launch maybe our last poll. We'd like to kind of query just to see what you all learned from today's session, what you found was insightful from today's session. So there's some of the options you understand machine learning to gain insights and real life considerations and transportation. So if you could give a couple of seconds there to take a look at our poll and while we do that, maybe I'll poll, we might have time for one or two more questions, but I'll pull another question here. This is kind of a forward looking question. It kind of extends this idea of the driver to now we have, you know, at least there's some companies are trying, I know actually there's some companies that are struggling with this idea of autonomous driving and how this might impact this situation. I know it's been proposed as a solution for this truck driver shortage, but it's also seems like there's really more of a utilization crisis. And so how does this idea of like autonomous trucks play into the mix here? Oh gosh, thank you for that question. So two thoughts come to mind first, but probably we could do a whole nother hour. The first is part of the way our supply chain networks are set up in our country, has to do with the hours of service rules on human drivers. So, you know, a driver can only drive 11 hours and then they have to stop. And so if you have, you know, your factory 14 hours from your warehouse, that has to be a two day trip. So there's incentive to keep things within a one day trip for a human driver. And a lot of the footprint of our supply chain networks is built around that reality. So the first thing that I think happens with autonomous driving is if the rules are relaxed for robot drivers, which you could argue they should be, the robot won't get tired, then the footprint of our supply network should change because the trucks can go further. They don't need to stop at that 11 hour break. So that's one sort of maybe exciting thing, particularly for those of you in SC2X and learning about supply chain design. There's a heartbreaking side to it too though that I think autonomous trucks will be hugely expensive. Well, I shouldn't say that they will be expensive and they will be very, very sophisticated state of the art robots. I think that the weighting that we force on our human drivers will not be forced upon the robots because they are so expensive and companies would see you can't let something that valuable sit idle for so long. So what I say that's heartbreaking is that I think we'll treat our robots better than our people in that case because the robot's time will just be considered by management to be worth so much more. Thank you, thank you for bringing that. I think it's insightful and probably bring into the table some things we don't have in mind open. We usually are analyzing data so it's great to have also that perspective. Thanks for sharing that. You got a lot of great comments from our audience and just a final thought before we go to the full results as you know, here in the audience, this is very broad one. Most people could come from our courses but there's also people out there in LinkedIn that joined us and we were wondering what's your piece of advice for those that are trying to implement tools like machine learning, but they are just starting with their SEM journey or trying to gather experience. They probably are not there yet into programming and they are just using orange. So what would be your advice to those that are trying to get closer to your way of researching? Oh gosh, so kind of anyone to be interested. I think a good question is worth more than a good programmer so it's really important to start at the beginning and if you can ask an interesting question and find a really important way to be helpful, the rest of it will come more easily. So don't overindex or overinvest your time. In programming skills, I think you need to be a sharp critical thinker and come up with good research questions. But then of course, you know, you need the programming skills and the only things I would bring to you is one, everyone fails all the time. I certainly spent days trying to figure out why my code wouldn't run on the course of this project and I think everyone does. So if it's not working for you, it doesn't mean you're bad at it. It just means you're in the arena with the rest of us. And the other piece of advice and I need to take this myself, my current students here at MIT are moving maybe a five to 10 times the speed of students years ago by using chat GPT as a coding assistant. I'm not that good at it yet myself but I think from what I've seen, you know, if you know the basics, chat GPT can help you execute the smaller pieces and that can save you a lot of time. So it's worth giving some time to learning how to make that your virtual coding assistant. Awesome, thank you, Dave. Those are great insights and suggestions for many of our audience who are learners in our RCX courses. And so I think with that, we're a little bit over time but I wanna kind of wrap things up here and share some of our poll results very briefly. It looks like, you know, lots of positive outcome, lots of learners and part of our audience interested in understanding how machine learning insights and real life considerations or transitions, I think that's great. I think that brought a lot to the table. Dave, if you have any final thoughts there on that poll result there? Oh, it's just that I think you're interested in the right space. I mean, there's a lot of opportunity here. So, so excited to see so many people equally excited about it. Awesome. So with that, I wanna thank you all in the audience for bringing all your great questions. There's way more questions that we would have time to. We need another hour just for those questions and also many more hours to discuss some of these fascinating topics like autonomous driving and optimization with machine learning integration and all those different topics. And so hopefully we'll get to bring you back here, Dave. I mean, Laura, it's always a pleasure to co-host with you. I mean, Dave, I wanna thank you very much for your time today. I really appreciate your insightful presentation and all the questions. It's been a pleasure to have you back here. Oh, honored to be back. Thank you both for having me. Thank you. Thank you, David. Thank you, Kellen. And to everyone, stay tuned. You know, this is just the first webinar in a series. So join us for the upcoming one. Thank you, everyone. Thank you, everyone.