 seven. Hello everyone. And welcome to our next EW session called a journey using machine learning to predict student populations, which will be presented by Tripp Ray and Art John. This is the data architect and analytics team manager at Chesterfield County. And Art is a data analyst also at Chesterfield County. All audience members are muted during these sessions. So please submit your questions in the Q&A window on the right hand side of your screen. And our speaker will respond to as many questions as possible at the end of the talk. Please note there is a linked form at the bottom of the page titled ADW conference session survey. This is where you can submit session feedback and we encourage you to do so. So let's begin our presentation now. Thank you and welcome Tripp and Art. Tripp, hello. Thank you very much, Ann. Well, I am Tripp Ray in fact. And I manage the data and analytics team for Chesterfield County. It's a group of people that are developers and engineers and analysts that curate data from most of the systems of Chesterfield County into a large data warehouse. So a couple of years ago, we were given the opportunity to see if we could leverage some of the data that we've been collecting to help predict where students might grow in the county in the next 5 to 10 years. Pretty exciting project. The only challenge that we had for the most part was that none of us had any experience with data science. So this presentation is a chronicle of the steps that we went through. Some of the lessons that we learned along the way, maybe a couple of aha moments. So I hope you find it useful. And I guess we'll just keep on proceeding here. So the next thing I need to talk about would be a little bit about our county. We are a fairly diverse county. It's 426 square miles of land as just south of the capital of Virginia. So 60% of this county is zoned as agricultural. But there are areas that are near the capital city along that border that are very densely populated and very urban in nature. We've got about 60,000 students in the county. We've got a meeting in Cal of 82,000, so it's a fairly affluent area. But the point behind the zoning that I'd like to make is that when you're trying to come up with some sort of rules or formulas to figure anything out about our county, it's so diverse that one size fits all if it doesn't really fit the bill here. So let's talk a little bit about our school system. It's a well accredited school system. We're recognized by several national agencies both for some of our individual schools as well as the entire school system. Now the charts I'm showing are some of the indicators for change from year to year that our school planners will use to try to compensate for bringing changes in student enrollment. The upper right hand corner is a map of our county. The areas that are on that map are the 2019 elementary school districts. And the color heading is a ratio of the current student populations in those districts compared to the capacity of the school buildings. The school buildings typically have a capacity of 750 to 950 students. So the areas that are redder are cases where the number of students that are currently enrolled in that school building are approaching and in some cases have exceeded the capacity for these buildings. The greener areas are where the number of students are well under the limit and in fact in some cases are decreasing. The chart in the lower left hand corner is an indication of our typical housing sales over the past 20 years. The red line along the bottom are all of the new home sales and you can see that's pretty steady. We typically add 500 to 1500 new homes in the county every year. The yellow line are this land sales so that is more erratic. That's a pretty good indication of developers that come in and buy large tracts of land to convert it into neighborhoods. So you'll see that kind of come and go. And then the green line on the top are existing home sales. So again fairly consistent, certainly up years and down years, but you've got anywhere from 3,500 to 6,000 existing homes turn over every year. And when you look at the household types in the county, we certainly got the mix of residential homes, apartment buildings, mobile homes, condos and townhouses, but the predominant housing type in this county are single-family homes. We're currently supporting 66 schools in our system and that includes a couple of technical centers. This county tends to be attractive to young families for a couple of reasons. One, aren't the high rankings of the schools in this county. Two, because it tends to be a bedroom community since we're so close to the capital of Virginia. And then lastly our property taxes are very favorable to young families. So many people will come in just because of the school systems. So let's talk about the housing sales for a minute. This chart again is a map of our county and overlaid are the number of home sales for the past year. The green dots are existing homes and the red dots are brand new homes. And you can easily see where all the clusters are that is an indication of denser populations. Across the northern section is that urban light area that borders the city of Richmond. Then down towards the east is a pretty heavily populated corridor that runs north and south of down the east coast. But that section that's just west of the Pocahontas State Park which is a green area center, that's a cluster of relatively new neighborhoods that have been developed from farmland over the years. And you can easily see that's where most of the new homes are for the past couple of years. As a matter of fact, not just last year. And then all of that is in stark contrast to the far west and southern areas that are very sparsely populated. That does have students there, but the property there is much, much larger plots and there are fewer homes and fewer students, of course. Another big factor are the number of newborns in Chesterville County. That tends to be pretty consistent as well anywhere from 2,500 to 3,500 a year. And that's the charts and buying. So our school planners have a couple of tools they can use to try to adapt to changing student populations. The lowest on the scale would be to create special programs so they will set up math or science programs for gifted in underpopulated schools to try to draw kids away from the highly populated schools. So that is one method they can use. Second might be to redistribute. Now that is not a terribly expensive option, but it has a high political cost. Moving kids around between one year to the next make a lot of parents angry. So they do that cautiously and make sure it's a well thought out process and then raise themselves full of home calls or any of that. Second or third would be adding trailers to schools. We've certainly got a small army of trailers that we could pull into the last minute to park in a school campus. But then by far the most expensive options are remodeling schools, adding buildings to schools or even building a brand new school. All of those require usually some sort of funding option that needs to be justified to the residents into the board of supervisors. So it's not taken lightly. And building new schools, of course, takes years. So trying to reject out that far needs to have a lot of credibility, be able to justify the $25 or $30 million is going to take to commit to that. So today, the most commonly used student forecast and methodology here just from the way is something called the Enrollment Forecast for Excel was developed by a Dr. Kirby Smith. And essentially, and I don't want to oversimplify this, but essentially it is an Excel spreadsheet that has many formulas in it. And they can use that to plug in the enrollment numbers in the past. And it uses cohort survival ratios as a basis for predicting enrollment into the future of the same schools. It is very popular that has been used for years. It's somewhat accurate. I mean, the cohort survival ratio is based on the notion that most kids that attended a school level in one year are going to move on to the next school level next year. But it is very limited on the kind of information you can bring into it. So our planners will start with this. Then they consult with our community development people to see where new developments are going up, where assessment values are changing, and then they'll adjust their estimates based on other input that is not available to the spreadsheet. This is also tending to lean towards the one side of the methodology, which as diverse as our county is, isn't always useful. Now, when I say it's accurate, it usually is much of the time. But if we are off as much as 10% per school, and that's not uncommon, that's an entire classroom of kids that you have to deal with during that first week of school. So, you know, if the first peak isn't chaotic enough, now we're trying to figure out what to do with this extra classroom of kids that showed up and bring trailers and remove teachers around or have you. So where did we start? First of all, we had no experience with data science. So we hired a consulting firm. And we brought in capital consulting, which I mean, excuse me, catapult consulting, which is a Microsoft goal partner, and they do have practical way of science experience. And being a data team, we figured the first place to begin was a data set that was based on school enrollment. It seemed like an easy thing to do. So we have a file that had one row per school, per school year, with a number of students in that particular year. That being data people, we wanted to make sure we had more fields in it. So we added other characteristics about the school districts in those particular years. So we added a total number of households for the total number of households by type, average assessed values of those buildings, a variety of fields. So we curated a data set that has 70 columns on it, with about 1200 rows, it's one row per school, per school year. You've also learned along the way that most of our models tended to prefer American values over character values. So where we started with values like maybe APT for apartment and HSE for house, for example, we converted all this to numeric values to make the model heavier. So we started with a tree based model. And given that we had 20 years worth of history, we held the last three as our control years. So we'd feed the first 17 into the model and use it to predict 18, 19 and 20. Now using a tree based model, it gives you a few towels and knobs that you can turn like the number of estimators, the learning rate, the max depth and the random seed. And we iterated through this, giving various values to those, to those knobs until we got the model moving towards the control set as close as we could get it. We were never able to make it as high as the control set, as you can see in this particular graph. But we did end up with the control years, at least in parallel and moving in the same direction as the control years. Now when we used that model to extend out into the future, it actually predicted a decrease in student population. So even though we felt good about the accuracy rate of the model, and even the model error rate was felt low, nobody felt good that we were going to have a decrease in student populations in the past in the next five years. So that didn't lie. So what we concluded was the tree based model was probably not the best approach for projecting growth. But before we went there, we thought we should look back at our data and see if we could find anything about that that might be sending us in a direction that we didn't need to. So we did a couple of course and chat plots on our input data set. And then we realized pretty quickly that the predominant field was still students enrollment. So all those other fields that we added, you know, number of households and sales and bedroom counts, none of those appeared to have any particular influence over the outcome of the model. So that didn't lie with us. The next thing that we wanted to explore is the effect that district changes have on our results. So in this chart, what you see are the number of students looking at them through two different lenses. So this is for one particular elementary school. And this represents the elementary school district as it is today. So if you could imagine that green line, represent the students that today is if the district and never changed in 20 years. And the green line of those students that are in the current district, you can see it's fairly consistent. They had a growth in 2003 and four, a little bit of adjustment. But since then, it's been fairly consistent for the past 20 years, maybe a little bit drop off towards the end. But the red line is the actual enrollment for that school. And what that's reflecting are some severe district changes that that school went through over the course of 20 years. So this arachnid behavior in that enrollment number knew was going to have some kind of add on our machine learning. So to underscore that even further, we tried to do a scatter plot to see if there was any sort of strong relationship between changes in enrollment changes. And as you can see, not so much. And the chart in the upper right hand side is telling you the number of students in the entire county or 20 year period. That's the blue line, because that's fairly consistent. But the red line, the number of acres that changed districts over the 20 year period. So you can see these huge spottics were entire neighborhoods that moved from one district into the other. And that is going to have a dramatic effect on enrollment. So the conclusion that we drew from this is that our model was seeing that the adjustments were being made in the past, and therefore, predicting justice would be made in the future. So we decided at this point that what we needed was a data set that took the human element out of the equation as much as we could. So we created a different data set. This one is based on land. So we have 20 years worth of history on our land parcels. So we began with that. And we've just your county has had between 100,000 and 110,000 parcels in the past 20 years. So now we've got a data set that has one row per parcel per year. And on each of those rows, we have as much detail about that parcel as makes sense to have. But whether it had a house on it at all, how many did sell that year? Or was this assessed value? Or was the sale price? How many bedrooms did they have? And maybe more importantly, how many students live there? And we're grateful to say murder. So now we have a data set that has about 50 columns in it, but two and a half million rows. Now, what we'd like to do is run 100,000 models. So we could tell what each parcel might do, right? But that wasn't reasonable. So we had to come back to an approach for how we might group this data together in some way to be able to manage it and make it more reasonable to handle it. So before we did that, we had to come up with a way to see if we were even close. So we did some analysis on the day that we built. And to see if there were any particular correlations between the number of households and the number of students. And what we found is that there is, there are some areas where the number of households seems to coincide with an increase in students. And that's the Greenland you see. However, the neutral area are where it doesn't seem to be any correlation between the number of students and the number of households at all. And the red is almost the opposite. So there are cases where the number of households that increase dramatically, the number of students have decreased. So we're not feeling real good that households has indirect relationship with the number of students. But just wanted to go through this exercise to kind of get our mindset away from that assumption, which has been pretty common for basic for their school. So back to our parcels. Here we've got all this data at the parcel level. We had to come up with some way to group them into some kind of categories that we could then do a reasonable number of models on and come up with some reasonable numbers. And yet keep it at a low enough level of granularity that we felt like was going to give us an accurate result. Now, being a county, you know, our land is all over the place. Many cities have the notion of sectors that they can develop, because most cities have straight lines that they can put blocks together and they have similar characteristics from the county because so much agricultural area that doesn't, you know, don't have the opportunity to do that. So we developed algorithms that start with the assessed value, the number of bedrooms, the lot sizes. And we came up with this notion of geographic clusters, if you will, that's smaller than school districts, but much, much larger than a project. And that's what we fed into our model. And that's how we use how to be up results for numbers by geographic clusters. So this time we use a linear regression model, beginning in these geographic clusters. And we were very surprised at the results. We've got a very favorable absolute mean error. We've got a the growth decline accuracy rate are the number of areas that were projecting in the direction. So it was a general increase in students and our models predicting in that direction. That was good or decrease in students and our model could decrease. That's that's what that's telling us. So we felt like this model has good value. Now the thing is, going back to our control years, we haven't been able to get it to be parallel with our three control years. But it is closer to the actual values. And once we projected five years out, it looks like we're getting a little bit of an increase in student population, and certainly not a decrease. So it still tends to under predict student populations, at least that's the way you feel about it. But we do feel like it has a better predicted value, certainly as a data set. And we will probably have more values, and then try a different modeling approach. And then coming back to our SHAP models. Now we can see that many more of our elements are starting to have an influential factor here. So it's not just about student enrollment anymore, although that is large. But now we see other factors coming to play like new construction, building counts, and various things. So just looking at the number of students at the land level, you feel like we're on the right approach here. So let's look at this from a map perspective. These two maps you see on the right are, again, back to the county. But the first one are the 2019 elementary school districts. The bottom one are our geographic clusters. So all the data that we predicted is at that geographic cluster layer. The next thing we're trying to figure out is now we have these numbers. What are we doing? So we decided to take these predicted number of students and layer them into a GIS map using these geographic clusters. So we did that by taking each of the parcels that was a residential parcel in that geographic cluster and assigning it its percentage of that forecasted student count. So if the geographic cluster had 1000 homes in it and is predicting 100 students, then each of those homes got a 0.1 as its contribution towards those future students. So next, we overlay those numbers into the actual parcels at the parcel level in that map. And what that enabled us to do is to come back with the elementary school districts, wrap those districts around those parcels, add the numbers back up. And now the school planners have a tool where they can glance and see what the model things are going to be the number of students in that school district five years from now. Now, maybe even more importantly, they can use GIS to adjust the school district and what kind of analysis they'd have it recalculate the number of students to see how that's going to affect the number of students in that in that district. And because all of this is in a GIS layer, we can bring in other attributes, whether they are planned developments, or their zoning cases, or their parks, whatever we want to bring in, we can add that to this layer and bringing additional information to school planners to be able to think about what it is you're trying to do and take that into consideration. And again, because this is a geographic map, you can drill down to the parcel level. So you can zoom all the way in to see street levels, parks, whatever you need to do to make adjustments to the school boundaries using natural boundaries or whatever you need to make a decision on. So now I realize I have moved much faster through this presentation than I did in all the times that I rehearsed it. So we've got plenty of time for questions and answers. So I think even though we are ahead of schedule, now I'm going to open it up to any questions that have come in through the question answers. So what are that? Hi, Chris. Thanks so much for this great presentation. If you have questions for trip and for art, you can sit in the Q&A in the right hand side of your screen. I came in early. What do the white regions account for in the map? Which map that was? It was really early on one of the first ones you showed. It might have been the student map, perhaps, or would have been the direction of our prediction models. Let's see. I think the prediction map. Yes. Let's try that. And there you go. First map. All right. So the color coding in this is a range of color. So in one experience, the dark reds are where the number of students were approaching the capacity and the dark greens were where they were the most relaxed. Anything that's a lighter color is what's in between those two. So fairly neutral in terms of number of students for the building capacity. I love it. So I don't see any additional questions right now. So what's been your biggest lesson from all of this? So two things that stand out to me. One is removing the human factor from the equation. Coming down to modeling data at the parcel level. So we're just trying to figure out where students have lived in the past and look for patterns that lead into where students are going to be in the future. That's from a geographic point of view. And taking districts away from the equation is much more valuable to us. And then being able to come up with those numbers in a way that they can visualize them, if you will, using school districts any way they choose to. So that was back because what we're finding are there are some areas in our county that are newer areas for the younger families in a higher concentration of students. Other areas that tend to mature. You have families that live there longer after kids have grown off the college and left the nest. So you have a lower concentration of students. But then they tend to go through a period where this house started flipping again and younger families come in. So the number of students in each area can ebb and flow over time depending on the immaturity of the area. So that was one big lesson. The second is what do you do with the number when you're done? So just coming up with a spreadsheet, which we did, of course, that has a list of schools and numbers on it was one thing, but being able to take those numbers and lay them into a map that planners could use to make changes with. That was pretty vague. So I would recommend one as your modeling, remove the human equation as much as you can. And then second, think about how you can take the results and present them back to you. If they can leverage them. Yes, just for this company. Right? I mean, yeah, just. And again, right, they got to do that. It's so difficult to, you know, keep up with the number of kids and where they are. But yeah. And have you considered extending the model to forecast number of students in grades by schools? We have not. We talked about it. But no, we haven't. And we do have a couple of blind spots that are working out. The only information that we have is about kids in a public school system. What we don't have is information about kids in our County that are either going to private schools or their homeschool. So we don't feel like we have the complete picture of students on the ground. And in fact, this year, because of the pandemic, we have had a drop in student enrollment, because there's a, I don't know, 5% of parents decided to homeschool or for them. So we expect them to come back, right? Or maybe not. But we're definitely some blind spots. Another question here. So one problem is that not a lot of data, that is not a lot of data. I wonder if there's, there are ways to get around the small sample size to increase the accuracy of model. Any cases with similar demographic use? Good question. Great question. And we haven't had 20 years worth of partial history. You know, I don't know what we have done. We did try to compensate for the enrollment model by adding numbers about the number of acres that each of those districts shifted in the given year. So we've fed that into the model as well, but didn't have any particular influence in the results. We tried coming up with as many variables about each of those district changes as we could think of to come up with to see if we could influence the model to pick up on that and use it in sort of a, you know, leading indicator way, but it didn't pan out. At least it didn't with our tree base model. Now, what we'd like to get back and do is maybe take our new data set of tree base model or try the enrollment data with a linear regression model and maybe it would. Yes, the additional comment and counting for charter and private schools would be so important in Pennsylvania. It is, but it's private information. So it's hard to get. Yeah. So it's any final takeaway or something, you know, hindsight is 2020 anything that you would have changed? Sorry, I was reading the questions. And in this presentation, the last slide are, of course, our credits. We couldn't have done any of this without all the data we've got now in the county and the information that we've learned from our Chesapeake County Public School system or the support we have in the catapult. So kudos to all of them. We wouldn't make any good progress that we have without all the support from our partners. But also, there is my name and email address. If you have any questions, or if you're also working on trying to come up with some sort of models for school enrollment, I'd love to hear from you. We're always open to new ideas. So our next step here will be to see if we can start predicting emergency events. So we're going to start collecting information about home fires, accidents, you know, whatever, to see if we can start looking for information that could predict where our emergency services need to be ready and for what types of events in the future. So, next step. Anyway, I don't think that's the question you asked me though. So what was that again? I just said, you know, Insight is 2020, you know, it's there is something that you would have changed. Yes. Yes, indeed. So I think anybody working in the public sector would realize that there is a thing called politics. And everyone has a different point of view about what their role is with the county. So, I think that's a good point. I think that's about what their role is with the county. So getting buy in from your school planners up front is very important because you need their information, you need their input, you need to have their guidance going forward. You need to be able to look at what you're getting ready to produce from their point of view, because they're the ones that are on the hook are coming up with some sort of reliable prediction to justify bond reference to build schools. So, being next to them is critical to being able to understand their life and be able to produce something, not only that they completely understand and support, but that you're doing it in a way that's consistent with the way they think. So yeah, that's such a step we kind of fumbled into, you know, we tended to start putting things together and pulling people in after the fact. And so we did, I think, spend some time trying to get to know each other and relearn all this along the way. Some of that could have been avoided, I think we probably developed a relationship with our school planners up front. I like it. And a question came in here, you know, you held aside recent years for test data, how does model change if you hold aside randomly selected years for test data? Ask that again? You held aside recent years for test data. So how does the model change if you hold aside randomly selected years for test data? Well, that's a good question. And one of the variables that we've encountered is that when you look at the number of students we've had in our school that have 20 years, back in the financial crisis that we had in 2009, I believe was, it was actually a dip in school enrollment. It wasn't much, but there has not been a continuous increase in students. There has been this general increase and then a dip in the middle and another increase. We found that dip in the middle was starting to influence the model. So we tried some different ways to compensate for that dip to see if it would influence the outcome of the model. So we weren't as comfortable reducing the number of control years we had, because we wanted to make sure we had strong predictive value going forward. But we did start playing with how many years we used to feed it and trying to skip around, if you will, that dip in the middle. But at the end of the day, I'm not sure that it really made a difference to us. And was the school district an active participant in this project? Are they going to use these models as justification for planning? Not so much. We certainly don't want them to take these numbers alone. It's just another set of numbers that they can use that they have to collect information about developments and zoning, economic conditions, and everything else. So it's just another set of numbers that they can use to help influence their decision and make a more intelligent guess. So at the end of the day, it's what you're doing, you're guessing, right? Indeed, well, I'll give everyone a little more time to any additional questions that they have. There's a couple of comments in here, you know, the model should also consider cases. Open did that. Firstly, schools that are closed, you know, anything you want to add to that? I'm sorry, you reckon a little bit? Oh, yeah, so there was another, I'll give everyone a moment here to ask any additional questions. But there was a comment in here that the models should also consider cases where new schools are open, that is that as that impacts the empathy of the district that closed to any additional. You're exactly right. And that is reflected in the enrollment model. You know, new schools tend to attract more kids, right? And there's this cause and effect there that is not real clear to us. And not only new schools, but even shopping centers. So this growth of urban areas and suburban areas, along with new shopping and new schools all seem to feed on each other. And it's not clear that one is a cause and effect for the other to us. But yeah, that was another on top of the district changes. We had new schools over the years. All of a sudden, you got this new school pop up in the middle of your model, full of students, right? So another reason to try to take the districts out of the equation. All right. I love it. Well, yeah. Anything that you want to out of the end here? That is all the questions we have for now. Think I'm good. And again, you've got an email address. So if you think it's something after the fact or have ideas you like to share with us, feel free to reach out. I love it. And you can find trip in or in the speaker section as well. If you want to connect with them, you can spot the app. I think we both have a wonderful presentation. We just want to note again that there's a link. This is where you can submit that session. And that wraps it up. You're welcome to continue networking with the other attendees within the spot me app as we take a quick break between sessions. We look forward to seeing you and don't forget to check out the sponsor section for more information about the tools you need to support data management programs. Thank you everybody. Thank you, Chris. Appreciate it. Have a good day.