 So, SMSs are a pretty rich source of personalized data in India and they're sort of applicable to a lot of use cases, especially in fintech, personal finance. And if you're a data scientist or engineer interested in this kind of data, then hopefully this talk is interesting. But also I think if you're a data scientist or engineer who's interested in complex problems where you're layering models on top of each other, then in a more generalized way, it might not be with SMSs, then hopefully this is also interesting to you. And I also kind of want to have two broad theme takeaways that hopefully come out through the course of the talk. And the first one is an illustration of the simple but important concept of breaking something into pieces and solving the pieces. And I like the way that the mathematician Max Teckmark puts it, where he says, you know, if you have a tough question that you can't answer, first solve, first tackle a simpler question that you can't answer. And it is a way of expressing this concept. And the second takeaway has to do with the architecture and design of these systems. When you're applying machine learning to a problem, at least in my experience, you almost never fully understand what you're building when you start, which means you'll need to do things 25 and 50% and 75% into the project that you didn't realize you'd have to do at the beginning. Unfortunately, you also face things like path dependency and early choices affect late choices. So I think a key concept to think about when you're thinking about how do I design the system in the early stages is the quality of extensibility. And in my experience, a lot of engineers and startups make the mistake of making a big issue of scalability while forgetting about extensibility. And I think this is a shame because you're far more likely to, we're far more likely to face the problem. We can only hope to face the scalability issue, you know, massive numbers. But we're almost surely going to face the problem of, we're going to face the challenges of breaking down because of extensibility. How do we extend our system to do things that we didn't know we would have to do? So a bit about, actually, the doesn't seem to be connected. Ah, there we go. Sorry. Appreciate it. So a little bit about me. And I think also just to give you some context for how I think about this talk. My career is kind of my professional experience is kind of oscillated between two key themes. One of which is being statistical inference, thinking about how do you actually ask questions about causality from data and in order to inform decision making. And then the other side is the engineering and the software because you can develop really complex and interesting models that help you, you know, apply causal inference. But until it can go into production until you can deal with all the dirty problems of getting data from one place to the other or making a decision and then recording what happens and was it really the right one, it stays pretty academic. And so I kind of gone from spend time in my graduate work in university and in government where I was thinking more about the statistical inference and then as I've moved into the private sector it's become more about how you engineer the system so that it actually works. And most recently I co-founded a startup called PaySense which is a mobile lending startup here in India based in Mumbai and then recent and this is kind of where this problem set came out for me and sort of represent some of the work we did there and some of the work I've continued to do afterwards. And now I'm actually at a venture capital firm where I'm a data scientist in residence and called Mountain Ventures and I do want to just give a quick plug because I think one of the things I like to think about is when you're investing in things to the notion of really understanding the product and technology that's being built that a lot of times entrepreneurs care about is something that you miss out on if you're not from that space as well. And so we're really into thinking about the problems that the engineers think about that entrepreneurs think about that aren't necessarily related to like necessarily just is it a good business but are you actually building a technology that's interesting and that's it's going to be applicable widely. So if you're interested in that kind of thing, I mean, come talk to me. Okay, so this is what we're here to talk about today. And if you have a phone in India, you know what these look like. You have them on your phone. And interestingly what differentiates SMSs which is a worldwide technology. In India a little bit differently than in Singapore and US other places I've lived and worked is that in India because of the requirement of two-factor authentication banks and other financial transactions that happen online, you have to involve approval through the phone. So we have like things like OTP and a side product of that is that because the phone is such an important part of everyday life anyway that a lot of the services that we interact with as consumers, as people have some element of the phone in them. So you'll get your notifications from Uber and from Amazon as well as from your bank and so it becomes as I was saying at the beginning a very rich source of data and the technical problem here ultimately when you're thinking about the use case of say giving a loan what you care about is how much did this person spend and what's their typical balance bank balance look like. But what we have as a raw form of data is an SMS that looks like this and the machine doesn't know that you spent 4,000 on the credit card. It just sees this sort of raw text and so the technical problem we're going to talk about today that sort of statistical problem, machine learning problem is moving from this to this in an automated way where we can sort of say every message is composed of a template structure and then variable information and so it's not your typical NLP, it's not what I think a lot of times people talk about when they're talking about NLP right, this is the problem is in a lot of ways not as hard as something like Twitter where you have just no structure, anyone can say whatever they want, there's latent structure here, it's just the variable amounts that are changing. We're all receiving the same message but it's not so simple as that in the sense that there's a lot of variation in the templates and other things. This is the general problem we're looking at and why does it matter? Maybe you do care about the use cases and not just the technical issues here, this is a FinTech market map in India and I would say I looked at most of these companies and most of them are at least asking permission which you may not read when you click OK and install the app for them to access your SMSs and trying to use it in different ways. I don't think this is not like a cutting-edge problem but I don't think necessarily that the solutions to it have been done as well as... I often hear people say for instance that this is a commodified problem. I think that's similar to sort of saying if you're in 2000 you said search was commodified. There's definitely been search systems around but the extent of their development and evolution is not quite there yet. You might see that like so in a personal finance we have companies like Walnut and they'll often not get things right where maybe you have two bank accounts and you make a transaction and you receive a notice that there's a debit or withdrawal on one of your accounts and a credit on the other and knowing that that's actually moving within your system and it's not like a net outflow and then inflow is something a lot of times they get wrong. Okay so before we jump into the system I also want to kind of point out that this is not... I've talked about this problem before and a lot of times people will come up and ask you know how can we access some of this data because you know it's interesting. And what I do want to say is that while you won't necessarily as an independent person you won't necessarily be dealing with some of the problems at scale. This is not a problem that you can't touch on, grab a hold of and try your hand at just because you don't have an app out there that's collecting other people's data. So by all means and I'll just sort of walk through a couple of ways you can get started. So say you have an iPhone which is the kind of phone I have which you won't necessarily get in your app but you basically go to this fairly obscure site if you take an image back up of the phone and you'll find this very obscure labeled file which actually ends up being a SQLite database and that's where your messages are stored. If you have an Android it's a fairly similar process. I haven't actually tried this too many times because I don't have an Android phone but it's worked once. And then that'll give you your messages and maybe the messages of some of your family or friends who are willing to help seed your dataset. There are other ways to try to initiate even more variety in your sample. So there's a lot of bulk SMS service providers out there who provide templates, provide other message structures for you to, for basically for apps who are wanting to send bulk SMS. You can Google those, you can pull them down and then all of a sudden you've got a whole new set of templates. And then don't stop there because that won't give you enough variety. I think the next step is simulation. And actually this is a sort of a side note that I think this is actually useful even if you don't need it. You have the data because creating this kind of data actually forces you to really understand what the data generation process was. How does it actually come from and it'll help you think through this? So what we're doing here is just a basic Python script using regex and some strings. So we have a template across the top, a message that happens to be a credit message. And then we can think, look, the way this is phrased in English is pretty arbitrary. You could compose it any way and in fact, we do compose it different ways. So let's just create some artificial variation that's sort of syntactically different but not meaningfully different. And then we'll just create a script that will cycle through messages like this and randomize and vary it. And so then from one template you've got many different templates. So instead of saying has been deposited to, you can say credited to or deposited into or deposited in or credited in. As humans, we don't care about that variation. But that's the variation we're going to try and ignore and try and decipher when our machines are learning. And we do need the variation because the variation ends up being the problem. If we only had five templates for a bank or even 35 templates for a bank that had all the different types of message, then you could just manually look in them and write some regex and brute force it. But that's not the case that we're in. So in India, just doing some basic easy research, you'll sort of see we have hundreds of banks of different types, as well as all the kind of service providers that are also recording and SMSing things as to your transactions and other things that might be applicable to the use case you have as a business or as an application. And so because they're all creating their own templates, creating their own message structures and entity types, this is where the variation comes. And this is why the sort of brute force raw, just manually labeling these messages is not necessarily going to work very well. OK, so now we've kind of set up that context. And we can think about how do we get to the information we ultimately care about? So from here on out, let's sort of set aside a lot of the type of SMSs and focus on banks, because those are really interesting. Those are financial data. And actually, I should actually just pause for a second and say, before we get into this sort of statistical stuff, the other concern that is definitely there that also this helps you make aware of is that there's a lot of personal identifiable information in here. And this is, in SMSs, this can be all kinds of private sensitive information from whatever your financial transactions are, if you're very sensitive about that, to Shahidi and other sites that are very personal. And one of the things you also want to do when you're designing these systems, and we had a panel yesterday where we talked about ethics and accountability and things like that, for me, it's very important. You should be thinking also about, when you design the system, how do you design it such that you minimize the information that you don't want to capture? Just a starting way, sort of the baseline, is that every sender that sends bulk SMSs is required to be registered with the government, and they get a six character ID address, which is different than a phone number when you're SMSing personally amongst friends and family. And so when you're setting up your SDK to capture this data, you would sort of exclude at least the personal SMSs and only go for messages coming into the sender, sort of a six character ID, which is a transactional SMS. Okay, so we have those SMSs, and we say, let's say we're taking the lending case. We're really interested in these particular pieces of information. We debited 300 rupees, and our balance is 49.46. And so this is our end goal, but there's also, this is just the highest level. So a debit is a basic transaction. A credit is a basic transaction. As you go in, you start to realize that there's a lot of also specifics in your ontology, where you have to really be thinking about your domain. So if you're lending money, you care about, for instance, as a person making their payments on time. So there might be a due notice, but you don't want to classify a due notice as a debit, right? It hasn't happened yet. You have your late fees, you have your bounce checks, your double bounce checks. They all mean something differently, and you can't just capture a list of amounts, right? You need all of the structure and the context for that amount for it to be meaningful. And that's what we want. And so we say, how do we go from this raw text down to this level? Like, what's our handle? Do we just build a model that says, let's say, classify this message as having a debit, and the 300 is the debit? So this is where the design, you start to think through, what do you actually want to get out of it? What kind of layers are going to be there? And how do you even structure a model? So a first step is, let's say, we're interested in banks, right? But we have a whole set of SMSs, and I kind of just easily said we're interested in banks, but the machine doesn't know which messages are for banks. And so as a start, we'll just sort of, let's say, we're gonna classify messages that are coming from banks. And so you might, you know, so that's, you think that's gonna be pretty straightforward, right? Because there's that latent structure in bank messages, and there's a very commonality of topics. So bank messages will tend to have sort of financial terminology, and then they'll have things like account strings. They'll usually have a date string, because it was a transaction, so it has a timestamp date stamp associated with it. So those things will help us identify a bank, and you'll like classify a message whether it's from a bank. But if we do that, we're actually leaving a lot of information on the table, right? We're solving a problem where we don't necessarily need to like look at a message and decide whether it's from a bank. Why? Because the sender is the bank, and the sender sends a lot of other messages. So if we jump right in and start classifying at the level of the message, we're leaving all of the other messages that that sender is sending on the table. So instead, what we can do is just say, let's aggregate all of our messages as they come from a particular sender, and then classify the sender. So we know now that HDFCBK means bank. And then we say, okay, now we have our bank messages. Before we can really get to the 300 rupees, we need to know the context of it. So for instance, just at a very high level, we have debit messages and credit messages. And then we can do an overdue. And then again, we're gonna say, we have a more limited set of variability of text because we know it's bank messages and it tends to be one of these types. So we're hoping that there's structure in there that we can sort of identify and classify to know that this message is debit and this is a credit. And that's really important because if you get it wrong, you have money coming in when it should be going out or vice versa. And that's gonna mess up your risk models. Okay, and then say we're interested in knowing a person's and tracking a person's bank balance over time. Well, we could model where the bank balance is in a message, but it doesn't always show up. So here's Citibank and here's two type of messages from Citibank that are both debited messages. One contains a balance information and one doesn't. So we wouldn't wanna find ourselves hopefully in a position where we're looking for the bank balance in a message that doesn't have the bank balance. It might end up getting confused in classifying the account numbers, the balance or the debit amount. So we also need to classify the entity type that we're looking for. So let's think about how we would do that because now we're not just classifying the full message for a bank, it was just a full set of messages. For a debit message, you're just classifying. You don't need to know anything about the positionality or anything in the message. You just didn't know this is a debit message. This has a balance. Now we've got that. We know this has the balance amount in it, but how do we know where it is? So one way to do that is say, well, we know that a balance amount will be a currency amount. So we can say, let's make some rules where we just isolate numbers that only contain, say, a period not associated with any alpha of the alphanumeric or dashes or anything like that. And so now we've identified, and we can do that just with a parsing structure. And then we can say, okay, once we've done that, we still need to know which one of these numbers represents which one of the entities. So one way to do that is sort of take a sub-segment approach where you take each of those messages and you say you have some n, say, three or four, and you do n plus three, n minus three, so you get a sub-segment on the message. And now you have that number and you can classify the sub-segment. So now because if you look at these two sub-segments, you can kind of see, well, we'll be able to learn that this, the 12,040 is the debit and the 1,30,000 is the balance. So each one of these is a model and we're kind of layering them on and we're not trying to solve this final objective until we've solved the earlier, and solved the earlier easier problem. And we have our pipeline and it kind of goes through so we're receiving SMSs and then we're sending them into each classifier and we're figuring out when to store and when to send and when to classify and ultimately our financial structure data goes into our database. And the system looks, might look like this at one step so I'll read it because I think unfortunately the resolution's not great enough for you to be able to read. So we have a new message submitted to the platform and it kind of goes through the system. In fact, I have some more granular versions of this so I'll wait for that. But what I kind of want to point out is that in this system, even though you've broken it down, for me it's also important to understand the human machine trade off. So humans will always be able to do highly sensitive sort of analysis. We can kind of recognize variance difference very easily. Machines don't do it as well. So like have a machine solve a problem but try not to rely on it completely, figure out how you can kind of combine the two. So we receive a message, let's say we're just classifying at the bank level, we compare the list, the sender to a list of known senders and we determine their identity. And if it's unknown, we send it to the model and if it's known, we assign to category, right? Great. Now, let's say we receive messages with an unknown sender. So then we might, so we'll go through some transformations on the data itself. Well, I'm on my strings containing numeric values. I'll talk a little bit more about that. We'll hash some n grams, talk a little more about that in a bit and we'll predict probability of sender category. We'll determine if the probability is clear and if the category probability is ambiguous or not, if it's not, then we're gonna go ahead and assign it. If it is, we're gonna send it to the user. What I mean by user in this case is especially if you're building the system like as a lender, you really need to understand the data that you're doing underwriting on top of. So it matters in this case that you're getting it right. So what we did was built out a visual interface and for that, you know, for the five or 10% where we're unclear, for instance, it's very easy to tell the difference in a bank message and like tends to be like an Uber type message or an OLA, but there's a lot of NBFCs out there that look very similar and so you'll get, so where we get our false positives is with that. And so we built just a basic system, a high level interface that would allow our customer service team and other teams to sort of sit there and as messages came through when we weren't getting the classification right or we were not very confident about our classification, that that would go through to them and they would be able to sort of say, oh, we missed this one, it's wrong and then they would relabel it and that would be transformed through Regex and it would go in and correct the model. And that was actually also very important for us and that's where a lot of the change over time happened where we realized we had to model things that we didn't think we had to model before where our risk team was like, wait, this is a late notice but it's a third late notice. You can't classify, you need to be able and it's not the same late notice, it's a late notice plus the amount has increased. So you need to actually be classifying that separately and not the same way, you can't just sort of say you've received three late notices. And that matters because some banks will send you late notices over and over and over again, they're just excited to get their money back and some banks will just do it occasionally. And so that shouldn't be related to the person's propensity to repay and so you don't wanna have that noise in your confusing your signal. So then there's a lot of the details right of the system as you're building it out. So a lot of times at the earlier stages what we're doing is we're classifying the template. We're not classifying the entity, right? And so when you're doing that, you're really looking for in that early version, remember you have the template and you have the variable and you're really looking to classify, you're looking to kind of exclude the variable. If you have the variable in your data, what we're doing is basically vectorizing text. You're gonna have a much wider dataset, right? A long tail where it's like a lot of messages where just a debit amount, that can be any number, right? And so that, but it doesn't really mean, the difference between 300 and 100 rupees is meaningless for your model. So like you don't want that there. And there's different ways you can do that depending on like your throughput and how much where you wanna do your computation. One way is for instance, it's like say your messages are being passed into a post grade database, some relational database, you might want to, when you're actually pulling in the data, say you've received enough messages from a new sender to be able to classify or run it through the model, you might want to clean it as you're querying it. And that works, and that works okay where if you're just sort of taking out numbers. Then you might get more complex and you might say, no, we're gonna do, we wanna be a little more variable in how we clean, how we transform the text before we send it to the model. So we might do it in a script. So for instance, we might wanna clean dates. And then we'll write some regex and we'll think about some logic to be able to identify the dates because dates like amounts, it shouldn't matter, right? It doesn't matter that it's March or April, that doesn't make a difference for identifying a debit message. And then we'll have other problems like size. So if you're lending and people are receiving 20, 30, 40, 100 messages a day or every couple of days and you've got tens of thousands of users, you're quickly getting a lot of weight. So there's different ways to handle that. When your early stages, actually, before I talk about the size, let me talk about pipelines. So because you have all the, once you start to do more complex text transformations, you might, we were talking about pipelines in a more abstract sense of the course of this talk, but there's also more particular pipelines like SK Learn has, Psyched Learn has a concept called pipelines. And what that does is allow you to package a lot of the transformations, a lot of the cleaning that you're doing and that you might be doing before you wanna send the data through the message through the model, but in a very clean and packaged way. And so basically it's just like, let's say you identify your vectorizer and you give it some specifications, maybe you have some missing data imputation or something else which uses some other data set to like find a value for the missing or other types of transformations. And then you can basically pass all of that to a pipeline. And then subsequently you can just call that classifier and all of it's nicely packaged. And I think that ends up saving you a lot of effort when you've got different text transformation methods being applied to different stages of the model, being able to like make sure that they're all cleanly in one place, your engineers, your data science. When you're looking back, you may not have written that code. You certainly won't understand it, even if you wrote it a couple of months ago. So when it's clean like this, I think this is like really a good way to do things. And actually I wanna give a talk plug because there was a talk at Piedator Chicago last year where I talked about this. I think it's a pretty good concept, especially for when you're doing this thing of layering a lot of models on top of each other. So then we can talk about size as well. Let's say we read, we have a couple of million messages. And it's a little more than we wanna deal with as we're sort of testing our model. So you can read it in chunks, sort of put it into an iterator, and you can use what the cycle has hashing vectorizer. And what that does is allow you to basically do all the same things you would with a vectorizer. But then sort of basically you're doing a partial fit. So you'll cycle through your iterator and you'll do your transformations and you'll apply your classifier with a partial fit. So basically it's just updating the model on the segment of the data and doesn't need the entire vocabulary. And you have to make a lot of choices when you're doing this kind of thing. So for instance with the hashing vectorizer, it's great because it's low memory scalable, the large data sets, right? Like I said, it doesn't need to store the whole vocabulary dictionary in memory. It's fast, et cetera, but you can't necessarily, before I was doing a TF IDF, I was doing a particular kind of vectorizing and you can't do that with this because it doesn't store the memory. And often what that means is that if you're trying to understand what the model's doing, like which features mattered, you won't be able to do that if you're using the hashing vectorizer. Fortunately in this case, right? We don't care about that because it's not, we don't need to account for why a message is classified in a particular way. As long as it's accurate, as long as the model's accurate, we don't really need to know like what actually were the features that caused this to be a do notice and this to be a late notice. And so there's a lot of these micro choices and then over time you get quite a large complex system and if you've sort of taken care at each piece level, when you're going back and changing it, it's fairly, it's much easier to go in, make your change to the piece, update the way you do it, expand the amount of information you can capture without sort of taking down the entire system. And so here we just, we receive a message, we do the classifying of the sender, we send to the model, we do all the vectorizing it, we repeat for messages, so we're sending it to the user so that they can clarify ambiguous ones and then we're doing that at each level of the model and that's the kind of system we get. So I wanted to end a little bit early I think I'm successful so that we can have room for questions. That's it.