 Thank you so much. The topic is really around the kind of intersection which I'm really interested in working in, which is basically intersection of data, visuals, and stories. To kind of set the context for this, I would just kind of start with a quote. So we don't see things as they are. We don't see them as we are. How many of you have heard this story of the blind man and the elephant? Few, few, not too many. It's a very famous story. It's a very famous story told in many cultures, told by Rumi Melvani, told in the Japanese culture, told in India, so across many of these cultures. And the story really goes around in terms of the six blind men who were brought to see an elephant. And what to see an elephant is really a misnomer because they can't really see. So they start to do what they can possibly do. They start to feel the elephant and see what what emerges or what they can understand about. So the first one goes to the side of the elephant or to the side and feels kind of like a very sturdy and broad side of the elephant says that that looks really like a wall. The other one is in the front and feels a tusk and sees something that's very pointy and sharp and like a spear. This definitely has to be a spear. The third one grabs the tongue and seeing that it riggles around and is pretty much feels and feels like a snake says it's definitely a snake. The fourth one grabs the knee and feels the rigidity and the solidity in it. The ridge is around it which feels like a bark and says that's definitely a tree. The fifth one looks is actually on the top and it grabs the ears of the elephant looking at them flowing from one side to another and sees this is definitely a fan. And the sixth one by the time it's able to come there is the elephant is turning and only reached the tail of the elephant and then definitely says this is a rope. So, you know, as so many of these men have in this time disputed long and long each in his own opinion exceeding stiff and strong though each was partly in the right and all were in the long. I don't know whether that sets the context of what you do in machine learning or data analytics to some extent but what we try to do mostly around in this world of data science is really trying to use data as a way of clue to the final truth, right? And I use the word clue very much like the blind man trying to figure out what an elephant looks like. It is really a clue to the end truth and that's why we try to use that as a lens that we can probably understand the world. So the data is a lens and as any lenses would have it they have their own parts and they have their own distortions, right? So the way I think about data science and the way I think about applying this lens is through kind of a layer of a trash, right? If you were to try and understand a phenomenon in real life this is what we're trying to do in data science whether it's somebody buying a product or somebody trying or any other aspect of data science or any other aspect of real life there we're trying to understand that and we can look at it in three different ways we can definitely look at the data that comes when we measure the real life when we measure in this case the pendulum, the person swimming on the pendulum or we can think of that as a visual extraction that we can then represent or we can think of the final thing as maybe a symbolic extraction that we measure and these I see as abstraction is there as representative of the phenomena that's happening so if they're close to the real life phenomenon they describe it in a very way that data is a clue but they do not fully capture it, right? The way to think about this is that these are three layers of abstraction and we're trying to understand a real life phenomenon through these layers of abstraction so the data extraction visual extraction and model extraction those are the three fundamental principles by which we work in data analytics we work in data science to some extent along with the story abstraction is how we work to a large extent in real life geometry without algebra is a term and algebra without geometry is a term and in my experience over the last few years trying to teach this topic trying to talk about visualization trying to talk about data science and I find a lot of people have a bias towards either one side of the abstraction either they are too much on the visual side or they prefer the visual side or they are too much on the symbolic abstraction side we're just trying to look at numerical summaries or outputs from there and they both go hand in hand they both go hand in hand and that's kind of the pieces of this talk in the sense how do we bring those things together so that we can actually enhance the power of our learning through the models of the time we take so to kind of build the context of what does it mean when we talk about model visualization let's take a very simple example and since we're in Singapore let's take kind of like a financial services example or let's play a trading game and the trading game is really very simple the trader basically has to decide when to buy or sell a particular stock so he has only three options that he actually can either buy a stock or he can hold to the stock or he can sell it and let's give him some starting constraints so he probably has a stock of shares so he's starting at about a thousand shares so he has a starting point so he has a portfolio that he's trying to manage which is about a thousand shares and there's a price of the shares so the price is at this moment a hundred single per dollars so we have a thousand shares we have a hundred single per dollars and we have about 500 minutes of trading available in the day so he's basically a day trader and if it's him probably some people could be in that place so we have these three constraints we start the clock in some sense we start to visualize the data so when you visualize the data we probably have an idea or we have a representation of what the data looks like so this is the price of the stock as it goes over the 500 minutes during the day and that to some extent is the data on which we find the balance so far with me and as with a good practitioner of machine learning or data science we start with a very simple model and the simplest model in this case is what I call the one minute strategy the one minute strategy is every minute I look at the price and I do this if the price is greater than 5% then I sell my one share one share out of my house and if the price is less than 5% then I buy my one share so that's my one minute strategy and the one in the middle I hold so I have my three actions every minute I take that decision as a good trained trader I should be probably watching it every minute so I do this decision and that's the model that I have so I can take the data that I had and build the model on top of it so I can start with visualizing the model and the first approach to visualizing the model is I want to visualize the model within the data space in this case the data space is minutes so I try and understand what my model would look like if it played on this data space so this is how my one minute strategy would play out I would probably start at about 1000 and by the end close to about somewhere in the middle of 775 or so so this is the pattern based on the data I have that's the first step we need to do and we don't often do is to visualize the model in the data space once you've done that then you go to the second and the third step in visualizing the model which is you start with wearing the model parameters or if you're in the tuning parameters you're wearing the model parameters so that you don't only understand how the model works you also understand the process of the fit of the model that's happening so those are the two other things you're trying to do now in the next step and to understand that we vary the model within some bounds so in this case we had a 5% up and down by acceleration we now move it to play a range from 1 to 10% so we have 1%, 2%, 2%, 4% all the way to 10% so we have about 10 parameters of the model and we now see how the model performs in the same space so the magenta line on this we have which is the 5% 1% and everything else is the range from 1% to 10% so we now have 10 lines so we're now visualizing the model and this is really aligned to build an intuition around how the model works what's the actual fit of the model that's what we have that's step number 2 and 3 but we don't really stop here we are interested not only in one model we're interested in the entire space of model that we can probably work with so we can add probably a hundred models but since I don't work in that space we'll add one more model so instead of the one-minute strategy we're going to add another model which is now I call the two-minute strategy and the two-minute strategy is basically the price remains below that threshold for at least two minutes then I execute my strategy if it doesn't then I don't execute the strategy so this is another model that has been built and as in doing this in real life you would probably build many models so you would add this space to the model space so you add this model to the model space and you now see what it looks like so obviously we have the magenta line which is the original one now we have got another 10 lines based on the new strategy that we have so we're now going to explore the process of model fitting and with this we probably have an idea about how the model is performing at least these two models are performing on the model space but we don't even stop yet we do one more thing we visualize the model this time with different input here because this is the strategy that works or this is the output on that one particular input but the input or the data would also change so we actually want to visualize it on many of these here so we take another sample of data or in this case this is a generated one so we have quite more generated data points space and when you look at that to understand how the model is performed and by now you can see that the model you start to get an envelope performance clearly starts to emerge based on the strategies that you have the two strategies the range of the model parameters that you have and the range of input parameters that you create so these five steps is what we've gone through to kind of build an understanding of how the model will perform I guess so that makes sense is anyone with me any questions we can check in the middle as a question in the end so this in principle is the model based approach to do a model based approach we need to do kind of five things we need to visualize the model within the data space we need to play around with the varying model parameters what I understand is the process of modeling when you look at the entire model space that is there and when you look at the model with the input we are able to cover these five aspects we would have a fair amount of understanding visually and then we can take some judgment so now model visualization is more of an art than a science because even though this is how we went about applying this model visualization in this domain this is fairly straightforward in this toy problem that I can talk about the moment you go to a real life machine learning problem the number of dimensions that you need to play around with the number of variables the input parameter that you need to play around really explore and visualization is really good at understanding what I say kind of the second level of ignorance which is I don't know what I don't know it's not really good at scaling very well with data so it is more an art and an art is also because it's not really developed at that moment that we have 30 kms of challenges in an art the idea of model visualization is what you want to do in modeling is basically aid transition I want to take out what is the implicit knowledge that I have about the data or what is in my head and in some way capture it exclusively in my model that is the process it's trying to aid it's trying to ensure that the mental model I have about the data or what I figured it out looking at the data or through my past experience is captured again in the model in some way it's physical it's probably trying to really build and this is important because the machine learning process goes through basically seven steps you start with learning a problem so you start with what is probably a good idea about the scope of the problem that you're trying to solve you then move to acquiring the data which is really the hard part of getting the data and figuring out 80% of the effort you then refine the data because data is inherently messy and then you start transforming because the data may not be in a way that you are able to process it so this transformation to another scale is one important part one important part in the data process you explore because you want to really understand the features before you can play around with it so I don't know what I don't know and then you come to the modeling center and then you have the sixth step in the whole scheme because you then want to figure out a model that will actually represent once you've done these steps you would probably come to the insight part where you would explain what is happening in this process now if you look back to our data extraction symbolic extraction and visual extraction you basically in this whole loop go through transformation, exploring and modeling in a very much cyclical fashion or in a continuous fashion you transform the data, you explore it again you transform the data, you model it again and that process is probably very strong that process is very strong when you do traditional statistical analysis but when you come to ML approaches the link between transformation and visualization may also be between modeling and visualization is really good the ML approach is really a very focus just on figuring out whether the predictive ability is very well if the model is predicting well I'm very much happy with it I may be careful about how I assess it but as long as the predictions are improving I am biased to take the better model than there is but in the process of doing that what we end up generating are much much more black boxes where the model does really a good job but I don't understand why something is happening so the intuitive mental model that I have about why this is working or why this is not working is really false up I have no way of capturing my real world knowledge into the model because as I move to more and more complex black box approaches the ability for me to understand that goes away the other is I don't really know whether there is wrong and sustainable in a way that if there is something fundamental that changes the market that I am trying to model will this model system and given my life to some extent to explain rather than predict only or to at least to both explorable explanation or explain as well as predict it is really hard to take machine learning models and deep learning just like that because you really want to understand what you want to do is to extend this link between model and exploration in a way that this is even possible in doing that does that make sense so how does this link back to ML how does this link back to machine learning these mind approaches that I talked about at the start how do you visualize the model within the day and space it links quick easily to dimensionality interaction so the act of dimension interaction that we do helps us to also figure out ways to do what we can do within the day and space the act of model parameter is effectively what we are doing in future selection the act of understanding the process of model training is what cross-validation does we playing around with the entire model space and figuring out in which combination what works what in some cases bring them together because we combine model to do better is what ensembles does and to play around with different input data sets and because in this case we cannot generate data in the way that I generated in my toy example we basically use bootstrapping to get that understanding so the model ways approach that I talked about actually maps very nicely to these five aspects of machine learning however the tools to do this are really very lacking both in python and also not so it is not really easy and one of the things we try to do is to really start to build something that will allow us to do it and we really at the early stage of doing this but the idea was to see if we can build a model with library that can actually effectively allow us to do all these five aspects of visualization in the model so to share what we have done so far and then what are the ideas we have we will take another example another toy example but this time much more from the classification much more from the machine inside so this is an example of predicting the quality of wine so this is a dataset from UCI dataset it is about red wines and it is about 1600 observations on 12 panaceas of gold features we have one target feature so the one target feature which is basically based on sensory data in the sense somebody has evaluated the wine and figured out our median evaluation the wine has been put around to say good that are in this case on a scale of 3, 4, 5, 6, 7 and 8 so we have one target variable and it looks like that a bulk of the data is in the middle between 5 and 6 and we do have 3, 4, 5, 6, 7 and 8 so somebody has made an evaluation about that and has as a subjective or judgment based feature then there are 11 features which are based on chemical tests to some extent so you have features which are 11 of them and I've ordered them in a way we can talk about the data but we've ordered in a way that makes some sense alcohol is one of the six features because it's wine, it is alcohol content is probably one of the things mentioned the density in pH is the other step this field sugar is another one it's acidity volatile acidity, citric acidity these are all acidic component and the acidity is generally like these aspects and the lean organics which are maybe the salt chloride and the free sulfur co-sulfur and sulfates really are the rest of these so there are basically five categories of variables to some extent and then different measurements of this if anybody has ever tasted wine we know that the taste of the wine is really much more than just these chemical components it's a much more of a balance between those rather than rather than explicitly variables each one of them just to remind them when you smell it or taste it so how do you combine that into a machine model? and most of the ideas that I've seen so far try and do a very dry ML approach which we just put all the features together with very limited teaching so being on this side you probably started some kind of dual explorations I've shared a couple of ideas around it and you probably started as some explorations are already in terms of how the alcohol or the quality are linked so one of the variables in this case alcohol and the other one is very quality and you can clearly see there is some trend that you're trying to assess through that so as a good student of ML you're probably going to all set of these single variable explorations that could one of them have for a few variables and then you probably add and do much more of let's say three variable explorations so try to see if there are three way interactions there and try to see whether there is a pattern that comes out of this and here clearly again alcohol and volatility you can start to see some patterns clearly the lower the worse off or the much simpler wine from three and four are clearly on the top left corner the better one is clearly bottom right so more alcohol that's as it really seems to be good for wine and you can continue to do this on two model basis, two variable basis as you would to explore and see whether there are patterns and you would do both numerical and you should explore it you would probably end up doing some level of multidimensional exploration so this is one of the easiest way to do 10-dimension exploration which is really around putting them all as a parallel coordinate and trying to see whether I can see some interactions around it so in this case the quality is available on the left and you can see three, four, five, six, seven, eight so you can clearly see six loads there from which everything is coming and if you link them together and select some of them and you can say three or seven and eight and you can start to see some interaction there so this is not really still modelized, this is still very traditional multidimensional visualization, but the idea is to build an intuition around it to build an intuition around it at least from the way that you guys mentioned we might get passed on the data to be more focused to be done so what possibly we can do a traditional ML approach could possibly be linear regression we throw in all of these variables together and if you want to just do quality restology on F-alpha pH surface to get a model score of about 526 and then you can obviously do much more of the other techniques but the idea here is what can we do once we've got a model which is a linear model around this, what can we start to see so that we can build some more intuition around whether the model is working or not the two techniques that you can still use is say visualize a model within the data space is really add the predicted data so start to use the predicted data or the data generated from the model as any other variable in your data so that you start to play around with it, manipulate it and figure out ways to visualize it, the challenge here is really around visualization, you add the variable and how do you find a way that you can effectively visualize there is one way to do it you're still using quality, you're still using alcohol as one of the dimensions but the blackouts now are really trying to show the progression line in one dimensional space that you are so once you put that another variable together you can start to do the visual exploration again to start to build much more intuition around how this works in the same in money dimensions so the moment you add the predicted data back into let's say a data frame it is another variable unless you can start to do the exploration and start to see how that correlation happens or is it really going to explain the way is the prediction really how and so everything that you could do on multi-dimensional visualization will start to apply to this that's one the second is you can actually visualize a model here and the most easiest way to do this which will apply across all models is to play with residuals because all models irrespective of their complexity as long as there is one target variable we have a residual so if you have a residual you can then start to see how much of the residual is being explained so what you're trying to do when you plot residuals is you're subtracting the pattern from the data and seeing how much of that is going into the model intuitively as you keep reducing the residual more of the more pattern from the data is going into the model you don't want everything to go because if everything is gone it's probably an overfitted model you want a simple model that explains as much of the pattern as possible so using residuals could be another approach but instead of plotting the quantity on this side now you're plotting the residuals and you started to plot the residuals with alcohol in this case and start to see whether there is again any difference so again the residual is another variable that can be started to look at and you can again start to see which of those residuals is going with any of the other variables and start to think about how you can explain this and you can again plot residuals as well as the residual so alcohol and acidity and you can already start to see some interesting patterns that you can probably explain wine for example is a mix is very a balance between a lot of these components that I was talking about sweetness, acidity, alcohol all come together to probably form a case what we've done here is a simple linear regression of all the 11 features is there interaction effects that would be interesting if you combine them in ways that would start to explain some of this ratio so you can bring a lot of this visually intuitive understanding that you have about the data and bring it to the data and try to start to visualize it and add this to your toolkit so adding the toolkit adding the data, adding the ratios is one way to do it and I'll share one or two more this is really a classification problem we are using regression but we can also think of this as a classification problem and start to just classify it and for simplicity's sake we can start with a binary classification we take 3, 4, 5 put it in one class and let's call it class 0 to the low class and the class 1 which is the better quality or let's say premium quality line is 6, 7 anyway 5 and 6 that seems to be an equally balanced classification we can pair up so one technique you would start to look at is alcohol and volatile acidity something else alcohol and volatile acidity and you'll plot it on a 2 class problem and you can already see at least on these two variables that you have something that you can start to segment these two classes the technique that you want to do with this is that you'll be able to play the data with projections because I'm only looking at the 2-dimensional view whether it's possible to look at this in 3, 4, 5, 6, 7, 8, 10 all the damage is possible you may really be able to do multiple projections so how would the data look if you were to rotate it in another projection one projection two and this obviously brings a lot of dimensionality reduction techniques that you have but what you really need in this in which Eisen doesn't have is really something that allows you to do guided tools which allows you to look at this data in many different ways whether to rotate it around and examine it and find the investing patterns or be able to do it through manual intervention so both either projection pursuits or guided tools will allow you to do that so doing that projection would actually enhance a linear intuitive understanding of how this model is a good possibility to explore the final approach that I just touched upon is really trying to think about once you've got it trying to understand the boundaries of that so this one let's say for example you start with a realistic regression and we plot that this is how the projection looks like or the classification looks like once you've done the logistic regression so obviously there is a line kind of like a line passing through that and we're just doing it on two variables in case we're not using the entire dataset so what we can do here is really evaluate the boundaries and evaluate the boundaries in only two methods to do it and then we can use the data with broader possibilities so where if you want to look at this map again but this time plotting in them the probabilities of who are the class you can clearly see that the 0.5 so the perpendicular you can start to see the boundary is probably going to be there so you can actually use that information and start to find these boundaries it's easy to do in a two dimensional space to find those boundaries and be able to play around it and plot it and start to look at it to start to look at it in interest ways the easy way at least for two dimensions which is what I'm going to share is create a mesh of the entire range of the data the moment you create an entire mesh of the entire range of the data you can start to see the boundaries easy so here is what a mesh would look like for the realistic regression so you basically take a range of data from 0 to 1 in alcohol and this is scale and you can start to see this is how the mesh would look like for the realistic regression if you were to do this again for say something like QDA or quadratic discriminant analysis this is how you would so you obviously have a quadratic curve which is separating it and in two dimensions it's easy to build this dimension if you want to see an example of maybe overfitted data then if you were to run a random question it would give you fantastic accuracy for these two and you can start to see how a random forest really overfits the data and you can clearly start to see something so creating a mesh to explore the data and explore the boundaries and where these planes are intersecting is another very effective technique now obviously mesh doesn't scale very well because in two dimensions your number of variables number of data points are still limited but the moment you have to create an n dimension mesh the number of data computation you are going to have so you need to probably go back to the same technique to create these meshes or create those models in an effective way and start to be able to do that now so this is still in our least stage we want to share some ideas around how we do this it's not easy a lot of these approaches are not articulated very well so we need to pick up where these techniques work and try and build a corpus of knowledge around doing this model visualization in a way that actually aids the ML part of it or aids the assessment and by no means I am trying to say here that we should not use computation as a way of taking care of a lot of this I am just saying we should need to bring both the visual part and the modeling part together in a way that helps us get a better understanding because there is still a lot of intuitive understanding that probably is not going to capture when we go in a very black box of stuff there is always the cost of dimensionality in this space clearly so more visualization doesn't scale anywhere else so we need to think about how to take care of that which means we need tools like projection tools and projection waves to be able to do it there is a lot of interactive tools so a lot of these tools really work well that there is interactive we built it for example the parallel part made that I showed was actually not done in python it's the only plot not done in python even though you can do parallel part in python the interactive part was through a very old tool called marketing which does it very nicely and python doesn't seem to have that at the smallest and obviously there is very limited practice development on this topic around python so what we are trying to do is we are trying to create two or three modules around this one really around projections so trying to see if we can figure out a way to do both the tools as well as look at how to create these boundaries and then obviously bringing in a way in which we can do bootstrap and cross-validation and get be able to plot all those possible models and plot all those possible input data to try and see if we can get 11 from that python so we are trying to build these three modules around it this is the GitHub repo right now which is very much much right now designed as basically Jupyter notebooks trying to build these examples and the idea is to prototype it a little more and then maybe try and see if we can at least take one of them at least a projection one to a stage where it can be used as much more standard of package eventually I worked in the intersection of data stories and that's my contact issues you must understand do you see these machine learning processes or how does it enhance the machine learning process? yeah at this moment at least when I was thinking about this talk it was still human in the middle kind of ideas in terms of in terms of how you look at the visualization but I think there is also an opportunity to start to take the predicted areas you could also do some kind of analysis or automatic analysis to again look back to humans what are the interesting views again in it I am a visual person to be honest I'm really a small data person as I call myself and I really like to understand what's happening and even though I'm happy to use all the tools that are built on automated I like to be able to explain rather than just predict and I think when you have problems and I really feel in a lot of work that I do with corporates there are far more explaining problems in the business domain than just the few prediction problems that ends up taking a lot of the focus and I think those explanation problems could probably benefit much more with that maybe image recognition or the tensor floor talk that we looked at maybe not so much left or right so you were talking about the curse of dimensionality I wonder if you have any tool-free methods for reducing the dimensionality the reason I asked you is in your wine at home you had EH and three different types of acidity which will be correlated to other so I wonder if there is any sort of method for maybe striking out some of that I think there are dimensionality reduction techniques that are already there and you can definitely in this case I think there are a couple of variables which are 0.6 or 0. even more than that coordinated and you can probably remove them from that data set and try and do that and then try them out both statistically and the other way to do it manually if you want to is going up this dimensionality from single to double to projections and parallel coordinates and trying to see correlation ratio if you want to I prefer that at least at some scale beyond a scale of 10 variables approximately is a human image it will not be beneficial and then you can go back to statistical techniques on dimensionality reduction at some scale beyond 20 there is no humanly possible to use visual techniques so that's a visualization great but it doesn't scale beyond a point and that's why you need better tools better ways to do projections and all those techniques that are not really there at this moment especially in interactive stuff because if interactive visualization could create them very quickly to explore data not for the idea of presenting but just to explore then we would be able we would not need so much about spending time how to you know, whether we want to create them I know it's so expensive to do in time while interacting if we don't really go down that way I did one talk on visualizing multi-dimensional data so everybody on my website going up from one to level three level all the way up to ten and how you could do that thank you so much thank you so much thank you