 about the company that I work for, which is Gojek. Not many people know about it. So Gojek is actually a technology startup based in Jakarta, Indonesia. It is referred to as a super app in Indonesia, which is providing around 18 plus services, which you can think of as a substitute or a aggregation of multiple companies in India, like Swiggy, Ola, Uber, Housejoy, Justine, et cetera, et cetera. Who am I? As I said, I actually work as a data scientist at Gojek. I have around four years of experience in data science. Previously, I have worked with around three startups in the field of logistics and mobility. Before we go ahead and talk about building an automated daily run sheet, I just want to talk about what a daily run sheet actually is, which is called DRS. People who are from logistic industry would understand that in the last mile part of the logistics, there's something called a sheet that you prepare every day in the morning, which you hand it over to the field executives when they go with the shipment to deliver. That sheet is called daily run sheet and it has details like it has serial numbers and it has all the variable numbers, the order numbers which are being taken by that executive, along with customer name, customer address, contact number, the cash to be collected for that particular order, what is the type of the order, whether it's just a pickup or it is to be delivered, and then customer sign and other details. Usually, if you see around the logistics operation, be it in India or outside as well, this part of creating the daily run sheet is manual. Most of the time, there are dispatch centers, which are centers from where the field executives leave with the shipments. So these dispatch centers have center managers who are actually making and sorting the shipments early in the morning, spending around three hours or three and a half hours sorting these shipments based on the size of the dispatch center for all the FEs to take it. Now, there are multiple problems which are associated with the manual DRS. As you know, efficiency is obviously goes for a toss when somebody is spending three and a half hours doing the sorting manually. Apart from that, there is a heavy dependence on the human knowledge because when you are hiring a center manager for a dispatch center, you are expecting him or her to know the entire locality. Only then he can use his knowledge to sort the shipments. So replacing that becomes a challenge for the industry, for the company. It also helps you, it does not help you at all to efficiently utilize your FEs or center manager or even the facility for that matter. So what's the solution to this? Well, automate the creation of the daily run sheet rate. That's the solution. And we will see how we can do that. So the first thing, if I have to automate something is what should be the source of data that I can use for this. And too many people got surprised that we can use customer address in order to automate the creation of the daily run sheet. Yes, you don't actually need the actual lat long points or actual special geospatial data in order to be creating a daily run sheet or even human intelligence or mapping of the human intelligence center manager knows into the algorithm. So it's just the customer address data that you need in order to create an automated daily run sheet. We'll see how. Now, this is the exact how of the creation of an automated daily run sheet. And these are the steps which are involved. One is obviously getting the address. Second is structuring the address. We will address this step in detail as we go ahead. The third step is NER, which is named entity recognition. Again, we'll discuss. The fourth step is your code and fifth is cluster. The first is getting customer address data. So address data can be available in different forms. As you know, the data, the address itself in India is not structured. All across India, you have different ways of writing addresses. And there is no set format that you can put even if you're collecting address in the most efficient manner that you think. So the challenge becomes more, I would say complex for third party logistics company because then you're dealing with data, address data collected from different clients who have different ways of collecting the addresses. And then, so the first, this challenge can be met by, you know, putting a logic where you have all the customer address data from the different clients in one format. And that should be just text. And with whatever the, there are occurrences of a particular client, you can remove it and then put all the addresses in one place and that becomes your corpus of the customer address data. Second step is address structuring, which is I would say the heart of it. If you have it done right, you have the automated daily run sheet right. And why is there, and none of the algorithm, let me tell you will work with the efficiency that you expect it to work for if you don't have structured data, structured customer address data. Why do we need the structuring of the address data? As you know, there is no constant format of the address, as I mentioned before also. There is no spelling check that is put when you are inputting an address. It's a free flow text in most of the cases. And there are combined words because customer is inputting, he or she can write anything. They can combine the words, they can write things which are not meaningful. We had instances of occurrences of addresses which said in the address field, you say, we got to see that you cannot find me, just call me on this number. So you can have addresses like that, which is not any address information. So you can just leave it. Then there could be abbreviations, right? Few abbreviations which are needed. Few abbreviations which are just customers way of writing it. So how do you take care of all these things? This is very much required if you are trying a machine learn something from an unstructured data that to address data. Now how do we do the address structuring? Now that we know all the problems that exist, the first thing that we will talk about is basically creating a vocabulary. Before you even go and think of doing a spell check or doing separating the combined words, what you need is the ground truth. You need some source of data which will help you do that. Do a spell check or do a separation of the combined words. And how we can attain it is by using a basically reverse geocoding approach where what we did was for a particular dispatch center, you create a grid of lat-longs, decide on the distance that you want between them and then reverse geocode them. Now Google has a particular way of storing the address which is mostly formatted. If it has the address, it will be formatted. If it does not have the address, it will give you no address. Now how it is formatted is not necessarily in this form where I have shown that this is the formatted way but it is generally separated by comma. In your original address, there is no separator. Trust me, there is no separator which you can use universally across your address to separate the words. But here, when you go for Google Maps reverse geocoded address, you see there is a comma separated address and it becomes very clear that something at the end is actually a pin code. Now then, the second last is country. Then you have city and similar things. So you have an idea of what these tags possibly can be, what each segment of the address is. And then from here, you can create a base of each of these tags, be it locality or be it sublocality, roads for that matter. In Bangalore, we have main crosses and then build over it. So that becomes my source of truth. I know that these are the sublocalities which exist for Kormangala. So suppose I am dealing with Kormangala DC, I know that Kormangala first block, fourth block, fifth, sixth and some layouts, these are the part of localities which will show up in my addresses. So that becomes my ground truth. I built a spell corrector over this data. So whenever, so, when you're doing a spell corrector, any similarity algorithm, not a usual similarity algorithm can help you. Now you cannot go ahead and use an NLTK's spell corrector because NLTK is not built for this purpose. It has not used address data for generalizing its library. So you cannot use that library. In most of the cases, you cannot use a predefined library. So you have to build things from scratch and they're not difficult. So you can develop a similarity score for each of the word that occur in your address with this base where you, I'll show how you can do that. Where you just compare and replace. So basically what you're doing with this step is cleaning the address, cleaning in the format of spell check and removal of the combined words. The third step when you will get into named entity recognition because it encompasses the natural language processing. You'll have to go through the pre-processing part of the data before, so the first step was cleaning, getting rid of spelling miscorrections and combined words. The second step is pre-processing the data as a textual data where you basically tokenize the words which are available in the data. So when I'm saying data, it is each address. So you're tokenizing the address into words that is existing in the address. The address is cleaned now. Then you're removing this top words. In general, addresses do not have so much of stop words because it's not, and it should not be an English flow of sentence, but you can have stop words of addresses itself. Remove, people might write anything, so just to be extra cautious, remove all the English stop words and also remove the address stop words that you see, which might not add any value to your algorithm. The third thing, important thing that you have to do when you're doing named entity recognition is a supervised training data set. How do we attain that is according to the steps that I have shown so far. What you have now is each address, all the words in that address, clean the address, separated and treated for the pre-processing aspect. Now, there is training that you have to do. Now, why am I training it? What is the goal of the training? The goal of the training is basically to identify the entities, right? Identify entities and the entities here are tags. So the first step is to create these tags, which you want in your address data. What we did, we went ahead and we found out around 43 tags that will define addresses. When I say tags, these are something like, you can see in the example here, there is house number 36, wire strangle apartment, SD-bed layout and then the address continues. Now, in order to show how tags can be formed is, house number 36, I have given a tag FNS, FNC, FNC. Now, FN is a tag, which is flat number. FNS is starting of the tag. FNC is continued, continued. So the moment I extract FN from this particular text, I get flat number. Similarly, BNS is building name, start continued. And the moment I extract BN from this text, I get building name. So these are the tags. The tags are FN, BN, sub locality and similarly. So based on the address that you see, you can tag them. This is actually a manual process where you'll have to go through the addresses to understand for each dispatch center, what are the ways in which the addresses are written. Do this dispatch center generally deal with, are there many roads around it? So it will have multiple road level tags. Instead of you say anything like sub locality, might be it is identified by roads than by sub locality. So that depends again on the kind of data that you're dealing with or a particular dispatch center that is around a particular locality. So once we have these tags finalized, you go ahead and train the data for around, what we did was around 8,500 addresses for a particular dispatch. And the training is not supposed to be heavily manual. Once you have decided on the tags, what you can do is train them through a macro basically or write a function to train it. Whenever you see an apartment, you tag it as BNC. And there are multiple things that you can automate and then go through the automatically generated train set to manually verify it. The next step is building the model. Now that you have a trained data set which is giving you tags for each of the word, what you want to build is a model that given an address will give you the tags for each of the word in the address. And the model that we tried was maximum entropy. And what maximum entropy does is if you have information, multiple information about a particular class, it helps you weigh them based on the different data points that you will see and gives you the class accordingly. So maximum entropy model also allows you, so it basically works on features. It helps you put your own features into its processing. So you can build feature data points which are, say n grams. Now what are n grams? It's basically combination of words or tokens. So suppose I build features on my data set, address data set and say that whenever the third word of a trigram or the last word of a trigram is an apartment, APT or APART or whatever, all the ways that you want to put it, the tag which follows before it is BNC or BNS. Now this becomes one feature. You can also put features like landmark indicators. We also had one tag which is called LMI, which is landmark indicator. Now they become very important. They're really important because you can put conditions like if you see a word in a biogram which is opposite, near, above, below, then just say the tag is LMI. So these are the features which will go into Maxent and it learns with the data and also the features that you have passed to it and gives you the tag for each of the word given a particular address. And the accuracy of this that we saw was 98%. So the Maxent does real good job in giving you the tags of the words of the address, but that's not it. That's not something that we are building. That's just the first step towards what we are building. So the second step is geocoding. Now that you have the tags for the addresses, how do I geocode it to get the exact point where this delivery boy has to go and deliver the shipment? Now in the geocoding itself, what you'll do is basically get the tags for all the addresses, form a priority of the sequence of tags that you see in each address. Because not all the tags, an address can be lengthy as 45 words, 50 words. Not all the tags are important. Not all the tags can get you the exact pinpoint of where they should go. So basically you have to form a priority sequence of tags and then create a geocode layer over it. When I say priority sequence of the tags, what do I mean? Basically give priority to all the tags which help you identify pinpoint of the customer address. And what are those tags? Generally the tags like building name or set of building names, set of building being tech parks or any smaller organization kind of a thing. Then you can have apartment names. All these things have much more priority than a sub locality or a locality. Because when you geocode a locality or a sub locality, what you get is the centroid of that locality which is not helping FE in reaching the customer. So what you want is exact points. So all the apartments and buildings, they have higher importance. Definitive landmarks have higher importance than non-definitive landmarks. What are definitive landmarks? Generally landmark indicators which with words like above, below, opposite, they are definitive. It says that it is there. There can be landmark indicators which are near around this much kilometer away, which is not a definitive landmark indicator and they have very low priority. So that's how you define your priority of their tags. And then you geocode it. When you geocode it, you have to put a logic in place when you pass a lat long to the Google API which is geocoding. You have to pass a combination of the different tags that you have got. So you can pass apartment name along with the locality or you can pass apartment name with the sub locality, apartment name with the road or something and then check on which one of it is giving you the highest accuracy because accuracy of the lat long that you're getting is very important. It should not be one kilometer inaccurate where the FE is just roaming around. Now the next step and the final one is actually to cluster. Now that I have addresses, I have the lat long for each of the address. I've identified the tags, use the tags to geocode and I have the lat longs with 200 meter accuracy. What I would want to do is still my DRS is not automated. I would want to automate the creation of DRS and how can I do it? Is for every dispatch center early in the morning you can run a job which does all these things in the background and once you have the lat long, you cluster them for that particular dispatch center. So when you're clustering, you basically get a few clusters which is defined by I would say majorly the sublocalities that this particular DC deals with. So this might be, if it is just core magla fourth block, this can be one layout, SD-bed layout, or chandra ready layout or the other layouts. So because each FE is assigned to each of these sublocalities. What we've used here is a modified K-means because there has to be a business logic put in when you are clustering also. You cannot, so what if you're clustering and one of the cluster has 15 shipments, the other cluster has 85 shipments. So you have to put a logic where you are putting a threshold on each cluster containing at least 35 shipments or whatever is your norm of the company that each FE should deliver at least this much per day. And then you can also customize it. So if at a particular day you see that the attendance of the FE is supposed to be 10, then the usual is suppose 15, then you put your clustering as 10, cluster K with K means with K as 10 and you get 10 clusters. Thinking and in most of the cases, FE's know the locality well. So they can go to each of these cluster and then deliver it. Now, how does it function? This clustering is what it has done is has created a label of cluster name with each of the addresses for that particular dispatch center on that particular day. Now, when this center manager walks in this morning in the center, what he does is just scan each of the shipment and what he sees is the route number or say the shipment pile number. So basically he'll see that this belongs to route zero, this belongs to route one, this belongs to route two and he keeps piling each of the shipment based on whatever route number it displays. And when the FE's walk in, it just takes 15 minutes. We reduce the time from three and a half hours to 15 minutes. And then when the FE's walk in, they don't have to wait because center manager is shuffling and sorting, but they just pick their pile and they move for there. And this also, we also automated the printing of the DRS where every route will get, you know all the shipments that belong to it and you just print the DRS and hand it over to the FE's. So that's it that I have. This is how you can automate. If you guys have any questions, I can take them. No questions? Oh. Couple of questions. First is that what is the advantage of doing all the natural language processing before using a service like Google? You can just throw the address itself to Google geolocation and get the lat long. Correct. We did that. So that is the first thing that comes to your mind when you are doing anything around geocoding or reverse geocoding. But trust me, it does not work because of multiple things. When you are throwing an address to Google for geocoding, it needs the address in a structured format. It needs you to give it the, give it the apartment name or whatever is the POI along with sub locality or something which it can map. If you don't give it, if you pass it a random address, it will give you the lat long of the centroid of whatever it can identify in the address. Sometimes it identifies the state. It gives you the lat long of the state center. Sometimes it identifies just the locality in the address and it gives you the centroid of the locality. But that's not something that we wish for. What we were wishing for is the exact point where the FE needs to go. And we could not achieve it by doing that. Hi. So yeah, thanks for the talk. It was very interesting. So how do we ensure that the process that you described for determining the lat long that it's always within 200 meters, like is that? Yeah, that's why I had it in the recommendation aspect. It's not always possible. It depends on a lot of things, I would tell you. It depends on the previous step that you have followed. It depends on how good is your tag at the first place where you fitted an accent model. How good is your accent? If the tags that have been identified by my accent, how good is your priority of the tags that I described and how good is your combination of the priority of the tags that you're putting to the Google API? As I said, Google API works in a certain manner and you have to exploit it in the best manner possible. That's why we were doing a combination of the tags, important tags and passing it to the Google API. So for one particular address, I would tell you we were hitting Google API like 15, 20 times and getting the lat long for that particular address only. And then comparing within that address, which one has the maximum accuracy by looking at the distance from the DC or from any fixed point. So that's how you, that is a process that you'll have to kind of fine tune and curate on your own. There's no set way which I can say, but one of the ways that I have done is something that I've explained. Hi, Divya, thanks for such insights. I was wondering, have you guys taken care of language input as well before NLP? No, because most of the addresses coming from the clients, it was in English. So we did not take care of the language aspect. We did not run into it, I would say. Had it been there, then yeah, one step ahead would have been the conversion or translation of the languages. Hi, so if you put in, say, 100 addresses, what's the time it takes to do the geocoding? How many of those 100 would you have to manually correct the whatever input? The geocode stuff or the training? From the address to the geocode. Okay. In 100, how long will it take? How many of, what percentage do you manually intervene? And do you have to keep hitting Google over a period of time or do you shortcut Google, say, over a period of time? So, okay, to answer your first question, how much time does it take? Generally, we were running it for each DC which has a load of around 20K at least at a particular time of season. So, for that particular time, we ran it for, it took seven, eight minutes for the entire algorithm to run. So, it is not much of a time consuming process. And second is how many, the second question of yours was, how many out of them do you have to manually intervene? So, once you do the entire process and the flow is created, you don't have to go and manually check them. But yes, we were collecting feedback. Initially, when you roll them out, you're collecting feedback from the FEs on how well you have been able to identify the lat-longs. So, that process continued for 15, two weeks or so where we took the feedback and fed it into the system. So, that is one thing. But manual process comes only in the training side where I showed you the supervised training happening. So, that is the process where the manual it will come, which also you can automate to an extent. To your answer of how many times you have to Google, hit Google API. Well, Google does not allow you store addresses or the details. So, ideally you should not be storing them and hitting them every time that you are using the API. Correct, correct. So, what, but then you have used the entire Google's resource in order to build it. So, it is actually, I would say companies call whether they want to store it. We were not storing it and we were hitting the API because it did not take much time. We were hitting the API every time and we were paying for the API. So, that was, but yeah, you're building a database of your own with support of Google. Hi. Hi. So, once you have the geo code, would not a routing, a vehicle routing problem be more useful to your use case than a clustering formulation? Correct, yeah. So, I did not explain here, but within each cluster we did vehicle routing as well. But, you don't necessarily need it because of, I would say, a phenomena that is very popular here, because the FE's know the entire area. They have it on their tips of the mouth. So, basically you don't have to give them that go here then go there and then go there. But yes, in case you want to get rid of that manual human intelligence also, you can just go ahead and route it within each cluster. So, that's what we did. So, I think routing it within each cluster makes more sense than routing it across the clusters. So, how do you define these clusters? Are these already defined that these areas fall into cluster A or B, or is it real time? So, for example, if I were to pick up from CodeMangla, will it be picked up with HSR pickups or with the domino pickups? Okay. So, when you're doing this entire process, it is happening for a particular dispatch center. And dispatch center in logistics is basically the end point from where the delivery happens. So, this dispatch center will be one dispatch center only, which will be CodeMangla. It will deal with shipments dealing to around CodeMangla or around that center only. So, you will never hit up a case where you have shipment from HSR coming into this. And how do you define the clusters? It is real time. It will happen based on the shipments that have been received that day. You might have a few shipments, larger shipments for a particular sublocality that day. So, it will be a richer cluster as compared to the other one. So, that's how it happens. Okay, and how do we include the time slot part in this? Yeah. So, time slotting, we had to modify the clustering logic. If you're supporting time slotting, like there are a few customers who want delivery to be done in this time slot. So, you basically, within each cluster, you'll have to create subclusters. And then, see if that makes sense. If there are just one shipment that is happening from three to five, then particularly, you'll have to include in your routing within the cluster and put it at the end. And this is what, it was a combination of both clustering and routing. So, in the routing logic, you'll have to put that time bucket in consideration because not all of the shipments in our case had the time constrained to it. Any more question? I think we're out of time. Thank you, Divya. So, the next speaker is...