 So, yeah, as the introduction said, this is my first opportunity to share my Python story with you, so I'm very happy to be here. So I lead a small team of data scientists in Credit Swiss, we have four, soon to be five people, we're located in Zurich, and although we're in Switzerland, we have the mandate to deliver advanced analytics solutions globally to all of private banking and wealth management. And so when I say advanced analytics in banking, I'm talking about use cases such as customer analytics and product recommendation. What you see here in the blue is what we work on typically, what you see there in the gray, so investment in trading, customer support, chatbots, fraud detection and so on, they're typically done by other departments in the bank. We're more focused on the clients and products. So those two combined use cases, customer analytics and product recommendation, we call client book planning. And when I talk about clients, it could mean either private clients or the people like you and me or our corporate clients, so small to medium enterprises or large institutions. And there you can see sort of the range of products that we would recommend to clients. So I joined Credit Suisse in 2015, and I joined during a long protracted period of cost cutting. There's also a lot of increased regulation on banks after the financial crisis, and it was important for the business to rationalize in this new era. What this meant was that relationship managers, so the client advisors who look after clients, they had more work to do because the coverage ratios had decreased, but at the same time margins were also decreasing, so the business was under pressure to generate more sales. And that meant that relationship managers who were struggling under the additional workload were being bombarded with sales leads and so on from multiple sources trying to drive revenues in this low margin era. So that's the situation, and where were all these sales leads coming from? Well, back then we had what were called expert selections. So these are people who are responsible for products who say for this new investment product we're only going to focus on clients who have a million or more, so unfortunately the poor guy with only 999,999 francs, poor in a relative sense, will not be offered this product. And that seems somewhat arbitrary, right? So it's better to sort of learn from the data what's actually driving people to buy certain products. And there were some initiatives in different teams to try and use the data to better make these selections, but they were very fragmented, so we had pure SQL, we even had SPSS, Modeler if anyone remembers that, and Perl, yeah. So a lot of these campaigns and sales initiatives were not well targeted and the methods to deliver them were not well standardized, but that wasn't the only problem. The biggest problem was how all of this information was delivered to the front. Yes, Excel spreadsheets everywhere. I'm sure a lot of you are familiar with this problem, and the problem with building a data-driven client relationship management in banking with Excel is that, you know, Excel is a very poor delivery channel. It's kind of a one-way street. You send them out, they often disappear. People can modify them. You have no control over what actually gets to the client in the end. And you don't have any tracking or feedback for the process, so you have no idea what's actually happening out there, so you can't even measure the quality. So in 2015 when I joined, I thought, I think we can do better than this. So I started out by architecting a process to do this data-driven relationship management called client planning. This is the high-level architecture. I think I should also add that we're not in IT, so we've no IT budget. We have the tools that are available to us and what we can onboard ourselves. And that's where Python comes in because, you know, Python's open source and it's great. So the architecture that I designed consists of, you know, the data. It's a data-driven process. On the left, some nice data extracts that you need for training models. You produce some models and then you use those models to try and predict which customers might have a need for a certain product. And once you have this information, you can select leads which are delivered downstream to sales channels. And if you want to be totally data-driven, you have to then measure what's happening in those channels and feed that back into the process. So I'm going to break down this overall process into its different components. Now I'll start with the data universe. You've probably heard that about 80% of a data scientist's time is spent processing and understanding data. So a lot of this talk is going to focus on that end of the whole pipeline. So when I came in to credit Swiss, the team was just me. So I had to set about starting to understand this huge field of data. There were hundreds of databases with hundreds of tables fed from any different applications and systems all around the bank. And there was very little to no documentation available, no user manual that would help me, you know, quickly get up to speed about all this data. There was some knowledge in the department and so on. But, you know, really to be efficient at this, I would need to use some kind of specialized tool. And that tool would be Django. Well, Django's a web framework. What was that going to help you understand your data universe? Well, it provides a nice front end. And if you scrape your metadata from your data warehouse, you can construct a nice data model that you can query and interact with. So this is a small application that I built called the Data Warehouse Browser. And I'll take you through a small example. So say if I wanted to, you know, I want to build a model to understand why clients buy an investment mandate, which is a product that people delegate to the bank to control their investments. And I need to find some information on that topic. So I do a search. I find it's in this ACT database. I click that. I get a list of all the tables. This is all coming from the metadata scraped from the data warehouse. I can even search amongst all those tables. I start typing man. And I see the first table tells me something about mandates. I click that. Looks like the table I need. So then I can, you know, click each of these columns and get a description. I'll click just the status indicator because I can't show you any sensitive data. And there you see a description of what this data is about. It's also a very common column. So if I click the synonyms, which is Django, you can query then all of the tables that have this column. And it returns a list which shows you that this is a very common column. It's appearing in a lot of databases. So you can use this information then to see what other tables have these columns and that gives you a clue how they are related and how you can join them. And also helps you then to build your queries on these data sets. So now that I understand my data universe, I have some tools to help me get started. I want to build some useful feature extracts. So that's like a feature engineering process. So how does Python help me there? So yeah, there I can use SQL Alchemy, which maybe most of you know is an open source SQL toolkit object relational mapper that enables you to work with SQL data through a Python interface. So if you're a Python developer and you're given a job of working with a lot of SQL data, you probably don't really want to start writing huge SQL queries. It's a bit tedious, which I'll show you in a minute. And instead, you can then just connect to your data warehouse like so. In a few lines, it's very simple. You create your engine pointing it at the database. You say which schema you're interested in, create a metadata object. And there it's just a matter of creating a table instance. And with that table instance here, it's called depo mandates. You can query for the records you need. And then just plug that into a pandas data frame constructor. Now you have your data. So like I said, it's very tedious to write SQL queries. Here's an example of a SQL query that you would have to write to get data from our MIS database, the management information system. It's heavily coded. You have to remember all these individual codes, which only the most experienced people around can remember. Instead, if you want to get data for a project, say around FX turnover or equities turnover and so on and so forth, you would either have to make a whole range of these queries or you can use SQL alchemy to abstract away all this complexity and parameterize some queries and programmatically build the logic to make up the query. So on the right, you have this nicer situation where for your whole project, you set up some parameters, which are common. So in this case, there's a particular area of the business I'm interested in over a certain time period. And I want to aggregate a lot of these features with a rolling sum. And then it's just a matter of substituting in your data type that you're looking for. In this case, it's turnover FX, which is encoded in a nice variable in our library, which everyone can remember instead of all these archaic codes. Yeah. So that's the structured data. We can build now some nice features from our data warehouse. We have other types of data that's very interesting for modeling customer analytics and product recommendation. One is the spending and income profiles of our clients. This is often called personal financial management. And the credit Swiss, when I joined, still doesn't have one of these products available to its clients. So at the time, I decided to sort of build my own light version internally. And to do that, I used Luigi, which helped me to process all of these. It's up to a million transactions per day to extract it from the data warehouse, do the classification in a distributed way, and aggregate that data again at the end for creation of features, which you could then go into the model. Yeah, some other data. Once we had all of the structured data and transactions wrapped up is the text data. So we've moved more into processing of unstructured data, such as log files and text data. And here is an example of how we extract some useful features from text. In this case, it's a corpus of client notes. Client notes are recordings of interactions between the client and the bank. It could be initiated by the client or by the bank. And within that data set, there's a lot of information describing what the products the clients are interested in, whether they have complaints, this level of satisfaction, et cetera, et cetera. But this data set is labeled with seven possible topics. And the world is a bit more complex. There's more than seven types of interactions between clients and the bank. And of course, if they're hand labeled, they often can be maybe unreliable. And one of those labels is called general, which is not very helpful. So you can use techniques such as topic modeling, which is built into scikit-learn to extract topics from within this corpus. Here's a short example of how you might do with a set of German language client notes. So one interesting thing that we do is to substitute a lot of these non-informative words such as a client's name or an account number with some sentinel values that have some meaning. So in some cases, that can be helpful for prediction. And then one other challenge in doing this is the N in the number of components when you eventually factor out this topic vector from this count matrix. N is the number of topics that you want in the end. But of course, you don't know a priori what that should be. So what you can do is make use of what is labeled with all its limitations to give you an estimate of how pure these resulting clusters are. So here's an example of a test on a subset of 100,000 notes. And we can see that within this data set, we have about 9,000 contacts that are related to online banking support in its various forms. So it's good to be able to recognize these contacts and label them as such because perhaps they're not so informative for an investment product, but they may be informative for churn or digital affinity or use of new digital tools, et cetera. So yes, now we have, we understand our data universe. We have some nice features. We have some targets which tell us what we want to learn, which could be which products people are buying, whether a client is going to leave the bank and so on. And we can start training models because we have the data to do so. And we can do this in one of two ways. There's what we call dynamic on the one hand and static on the other hand. Dynamic is a setup where you want to understand what the client was like when they acquired this product. So going back into the history and looking at their profile before they made the purchasing decision and static is an alternative approach where you're just interested in how the client looks today. And in those situations, you're thinking about products that don't materially affect the client profiles and usually involve decisions that the clients can make day to day rather than taking a while to think about. So with those two approaches, we use these libraries. It's probably not a surprise that this is the main stack we use. I don't need to go into too much detail here because you're probably all using them. I just want to make a special mention to this book, Python for Data Analysis. It's the book I learned Python with seven years ago or so. And it's my favorite book. And I'd recommend anyone who wants to get into data analysis or data science with Python that there could be far worse places to start. So a lot of the models we train are trained on very imbalanced data because we have some products that have different categories. So a premium package for more wealthier clients and more entry-level products such as a banking package. You get a basic bundle which is a card and an account or you could have insurance benefits, a platinum card and so on that's premium. So if you're trying to understand what makes people buy these products, you have very imbalanced data sets to work with. So there you have to use methods of balancing where appropriate. And I would recommend looking at the imbalanced learn library, which is a great complement to scikit-learn. And there it has its own pipeline function. Pipelines are great. In scikit-learn they have the same API as transformers and classifiers. The fit, transform, predict. Not sure why they chose fit instead of train, but fit, transform, predict. And so you can package all the steps in your machine learning pipeline into one object and then treat that like a classifier. So that allows you to do things like grid searches for hyperparameters as though you would with a single classifier. In this case we have a function from imbalanced learn which looks exactly the same as make pipeline from scikit-learn, except this lets you put in these sampling methods, like this random under sampler in the middle. Which is a very simple way of balancing your data set before you do some learning. There are more advanced methods of sampling such as smote and so on, unless this library gives you a lot of possibilities to do balancing. Yeah, so one of the favorite aspects of using pipelines is its modular design. And we put pipelines in our pipelines and we like to reuse existing pipelines because we have a lot of the same data in different machine learning problems that we're working on. So we use column transformers to split out the numeric and categorical data. And we have a lot of categories that we want to treat in a certain way. For example, if we have the domicile of the client, there are certain products that can only be offered to clients in a given domicile, say only clients from Switzerland or not clients from the US. So depending on the product, you can keep the data as it is, but use a custom transformer to convert that into a kind of reduced prototype set of categories. And just plug that into your new pipeline as it's needed. So we also use one other module quite heavily, but it's our own in-house library. And that has been very helpful for us to standardize the machine learning process. Like I said, machine learning in our department was very fragmented before with different technologies. Here we build reusable components and help abstract away a lot of complexity, such as creating training data sets from our data universe. Doing it in the right way consistently and avoiding having to write the SQL alchemy, the SQL code yourself. So its main function is bridging the gap between the nice Python universe of modules and their upstream data sources, as well as our downstream sales channels. So it helps us do our work more efficiently. So we have now got to the point where we have some models. We've trained them properly, hopefully. And we want to now start making recommendations to our client advisors so that they can offer these products which the client may need. And in order to do that, you take the features which you've used to train the model, of course. You need the current values of those for all your clients. You want to exclude using the targets. Those clients who already have these products. And then you also want to incorporate any exclusions that, for example, clients that cannot be offered this product because they refuse it in the past, or they're too young or too old, or they're in the wrong domicile, et cetera. So you put that all together, and you make the predictions. And with most of these models, you get, if it's a classification model, it will give you a class plus a score from zero to one. You can interpret as how likely this client is to buy this product, say. And you can use that to then rank clients and maybe focus on those that are most likely to buy this product, instead of simply trying to contact all and wasting a lot of time. So providing the score and the class to the client advisor is probably not enough because the client advisor needs some information to try and sell this product. So they need to understand why I should contact my client. Why should I offer them this investment advisory product? So here's an example from the front tool, which is one delivery channel. This is what the client advisor works with on a daily basis. So there you see rows of leads that we provide for different products in their client book. And they can click the lead, and you see a breakdown of the different rationales why this client needs this product. And in this case, it's an investment advisory product. You can see that they don't have an investment consultant. But they do have a lot of equity, a lot of assets. They do trade a lot, and they do have a lot of contacts related to investment. So that's a good lead for this product. So how do we do that? The models are just providing scores and classes. So how do we provide this tailored list of rationales? Where we use explainability methods. One explainability method I want to tell you about is tree interpreter. And this helps us see the forest from the trees. It's a very simple package to use if you have a random forest classifier, for example. So here's some snippet. You train the classifier. And instead of using the predict method of that classifier, you use the TI predict method, the tree interpreter predict. And that returns to you not only the prediction, but also a breakdown of this prediction in terms of bias and contribution. The contribution is the contribution of each feature in your model to that score. So it can be positive or negative. And it does that by aggregating over all of the trees in your random forest. Here's an example of kind of the contributions for a churn model. You can see they're negative. So these are all features that reduce the churn risk for this client. So they have some volume. They have some mortgages with us. They tend to be clients who stay. They have inflows from other banks. So they're obviously happy with our service. So they bring assets from other banks. And their account volume is growing. Their private account volume is growing. That way we can take this data of contributions and combine it in a sensible way that allows us to provide a list of rationales and reasons to our relationship managers and advisors of why this client needs the product. So the last step in the process is monitoring the sales channels. And this is where you get the feedback and enable improvement of your process. So the leads go to the sales channels and then we have a monitoring process. So we're monitoring the activity on these leads. We're seeing which ones are closed, which ones are won and lost, accepted, rejected. And then we're doing metrics such as impact on revenues, precision recall, etc. And then we feed that information back into the model training process on a regular basis to update the models. Plus, we also get information about which clients have rejected these products. So we don't have to include them in future campaigns and can exclude them from any future deliveries. So I just want to make a quick note. This is not a Python part, but like I said, it's important to be able to track what happens to your leads that you can't do that with Excel. But you can do that by building a front tool which has an interface and a process whereby the leads go into the client channel. There's an ability for the advisor to see and assess them, to make a contact, to meet the client and potentially close the deal. And that's all recorded in the system and it comes back to us in an evaluate step. And then we can use that information to improve the process. So what I showed you there was at the top left, which is the client book planning, which is the client advisor workbench part. We have other channels. In blue, we're also having a data-driven process in mailings, as well as in the contact center. And we're also looking to build out these processes in other channels. So that brings me to the end of the talk. I want to say thank you first to the audience, but also thank you to the Python community, especially the developers of the PyData stack, all of these great libraries that we use on a daily basis. And it can be a thankless task sometimes, I can imagine. But it's great that people like me can use these tools and solve real world problems in a very easy way. So thank you and thank you, the community. If anyone has questions, there's microphones in the aisles. We have two or three minutes. Hi, so thank you for the presentation. And I wanted to ask because if I understood correctly, there's a little of feedback loop that the leads that are generated, then there are sales, so they are going back into the training as correct data. But what about users that were not included in this first batch? Will there be leads for them generated from some random, just randomly? Because there is a risk that we won't spot the new group of customers. Because we will be constantly focusing on this loop. Have you anticipated that in any way? So if I understand your question, it's about missing out on potentials by only delivering certain leads. We tend to just deliver the leads that have the highest ranking according to the model and those that don't have a sufficient level of potential are not sent to the client advisor. But we do use some of these clients as a control group when evaluating models or evaluating new channels because sometimes a new channel might not be effective because the channels are not properly constructed. So you have to mix in the potentials with the no potentials to be able to do a proper statistical analysis. But otherwise, we just focus on those that the model has high potential for. Because we see in the evaluation of the models that there's good accuracy. Okay, thank you. Is there someone else? Questions? One question for me is about the Excel usage. Why are you totally banning this one? Did you try somehow to parse the files and so on so that your users can actually keep the tool they are used to use? And you can also extract data and, I mean, analyze it. Why did you just, why this red flag for Excel files basically? Yeah, so we still use Excel file sometimes for kind of evaluations of new models or just for a quick turnaround for new stakeholders and so on. But the problem with Excel is you have no control over it once it goes out. So there's nothing stopping the person you deliver to from screwing up the data or getting the indexing wrong. And then it comes back to you and you kind of have no information you can use. Excel is, I don't want to hate on Excel. It's a great tool. It's one of the 50 things that made the modern economy, as they say. But as a channel for a data-driven process, it has a lot of limitations. Yeah, the user can do whatever they want with the data. That's the problem. You can't lock it down. And do your users are willing to just exit Excel for the sake of data analysis in the future or are they, I mean, do they understand the advantage of leaving Excel for a tool that will have more limitation basically? But that will ease your analysis, which is basically not their job. Yeah, people do love their Excel, right? And they like to keep their own numbers. So you have to make sure that the tool that you want them to use instead is probably easier to use and gives them some benefits. And that's why we were closely involved with designing this client advisor work bench so that it gives them the information they need without them having to keep their own Excel sheets. But no, I think we need to have a rigorous process that gives high-quality data that feeds back into the process, so not Excel. Hi, thank you. I was wondering how you serve your models or the results. In the process, it seemed like you put out a data set like the predictions and then you feed that into the front end. Is that the way or is there like some APIs involved or some completely different stuff? So I didn't catch the first part. How do we what? You serve our results of your models. So you serve your models. Serve, yes. Like make them available to the front end basically? Yeah, so we have an interface between our application and this application where the leads are displayed. And it's not a very sophisticated process. It's a flat file which a lot of things work still by flat files. So it goes from one data base to another. And then it gets updated in the tool and then it comes back to us via a data feed in the warehouse, sure. Paul Hughes, everyone.