 So, this is the link where you can open it up and basically this is the repository. It has the slides as well as the code. I would encourage you to check out the notebooks as we go along if you get time or you can refer to it later because it's a 45 minute presentation. So, we may not get time to cover each and everything line by line, but the intent is to understand what's going on, how do you visualize data, what are the main techniques or strategies we use, and then we have some nice bonus content also. So, how many of you have seen the agenda from the conference side? Which part do you find the most interesting or which part are you here for? Bonus, okay. We'll try to cover the initial parts quickly. So, I can even walk through some code about some interesting deep learning use cases which I have been working on for one of my books. So, yeah, I expected the bonus will be, that's why I added it in. So, yeah, okay. Good to see a full house. I mean, honestly, there are three other great talks going on. So, thank you for coming to this one. I myself wanted to attend another one. So, I really thank you all for coming here. So, honestly, I was expecting maybe 10, 15 people, but great. I underestimated myself. So, yeah, we have four minutes. So, what are you expecting from this talk? Anything in particular you want to see or anything you are kind of, why did you come here in general? Like, what are you looking at? Yeah, go ahead. Okay, sure. Yes, yeah, that was one of my, that was going to be one of my questions. So, how many of you here use Python or R in general for data visualization? Okay, wow. And the commercial tools like Tableau or anything else, okay. Good mix. Yeah, so today's code or examples are mostly in Python because that is what we use internally also for our use cases or even I use in general. But you can do a lot of stuff in R also and definitely you have the other open source as well as commercial tools. We won't be showcasing examples in different things, obviously, due to the lack of time, but we'll try to cover some interesting problems in Python and basically start with structured data as you have seen in the agenda, which I'll be showing you shortly and definitely cover the bonus content, show some nice examples also. And in case of any issues with accessing the link, just let me know. I'll reach out to you. So yeah, we'll get started since there's a lot of content, but want to cover the interesting parts also towards the end. So today's talk will be on data visualization as you have all seen the session, I mean the agenda. So basically, this is the agenda which I've also put in the website. So a basic introduction about what is visualization, why do we need it? Why do we need effective data visualization? Then moving on to the interesting part of how do you structure a visualization using Grammar of Graphics, which is a very interesting thing, instead of playing around all over the place. Talk about some tools and frameworks. Mostly I'll be just saying what are the frameworks out there, because there is not enough time to cover details about each of them. And then some more examples of visualizing structured data using the Grammar of Graphics, again, where we go from one dimension up to six dimensions. And then the interesting part because of which all of you are here, visualizing unstructured data. So what do we really do in text data, image data, and also audio data? How do you really visualize something which you hear, right? So which you can't see, so, and then the final words. Slides and code are here, as I've also shown it before. So feel free to access it, and you can even open the notebooks, check it out now, later, and so on. So yeah, a bit about me, basically I'm a data scientist at Intel, and I do a couple of other things like writing, doing some research, publishing articles, papers, and also mentoring and training people around data science. So understanding the what and why of data visualization. So the idea is storytelling with visualization where you are building up a nice story with regard to data which you have, using the right visuals and using a combination of these two, you build a nice story, and how do you do that? A combination of art and science. As I said, data science typically is not just science, it's art, and you need a combination of art and science for visualization. So what is data visualization? Basically it's a technique to build a visual from data to convey the right information, that's the key definition. And why do we need it? So data aggregation, summarization, and visualization, typically the core pillars in any data science or analytics pipeline and be it from the traditional BI days as a lot of you have been there, and till now when we are moving into the age of AI, it's still there, right? It's adopted widely by the organizations and even if you like build a model, in the end you have to visualize the results so people can see it. So why do we need data visualization? Like, why is it important? So as you can see here, we as humans are wired in such a way with our eyes and brain that we kind of try to find out patterns in everything which we see and interact with daily, right? Like, basically if I'm traveling from my home to my work, I will try to find out a pattern like, which is the shortest route or which is an interesting route in which I might take less time, things like this. So basically we try to find out patterns and building stories and narratives from them on a day to day basis. And from the business context, basically you shouldn't really be building a model directly with the data. You should try to understand what is your data, what are your features, what is important, what may not be important so that you don't end up blindly building a model which can lead to some cascading adverse effects. We'll show an example of this later. And the idea is to enable better business insights to the right visuals. Some general interesting statistics. So 90% of the information is basically transmitted as visual information in the human brain. And as you can see, we process visuals 60,000 times faster than text. So obviously if you have a visual, it's very easy to see it. As you can see, we basically can process and observe visual and understand it and comprehend it within a fraction of a second. So that's why obviously, better than reading a bunch of text, as you can see this visual, you process it immediately. And 65% of humans are visual learners. We love to see things. So now an interesting thing. So what do you think is common in all these images? Yeah, they have points definitely. Yeah, exactly. So the summary statistics of all of these is interestingly the same. But as you can see, the data is widely different. So that's the reason why we need data visualization. Obviously, this is an interesting adaptation from the classic visualization, which a lot of you might be knowing the ANSCOMs quadrature. As you can see, four different types of visualizations, but the summary statistics is pretty much the same. So this basically tells us that don't model your data blindly. Definitely visualize some aspects of it before building a model. So why effective data visualization? Why do we need to build effective visualization? So the answer is pretty simple. We need to focus on abstracting out unnecessary data, noise and clutter and show the information in the right way with simplistic visualizations instead of building out fancy visuals. And obviously, the way to do this is to leverage the concept of grammar of graphics to depict the right information using clean and concise visuals. And as you can see, typically some famous sayings, a picture is worth a thousand words and John Tukey, the founder of the box plot, which we often use day in and day out, tells us that the greatest value of a picture is when it forces us to notice what we never expected to see. So not only seeing the expected, but also the unexpected. So quick question, who is John Snow? Game of Thrones? Okay, so we'll be talking about the other John Snow today. Famous physician who built a nice visualization actually in 1854. So, yes, exactly. A lot of you might be knowing it's very famous. So visualizing the Broad Street cholera outbreak, obviously, we didn't have any fancy tools, no frameworks, no dashboards or whatever right then. So as you can see here, if you check out this visualization, these bar charts are basically like a dot plot. These are the number of deaths due to cholera occurring across this area of Broad Street in the city of Westminster, London, England. So over 600 deaths happened in 1854, and people were not knowing why so many people are dying, right? Because it was, I mean, you didn't have good healthcare back then, and so on. So the physician John Snow identified the source of the outbreak as this water pump, as you can see. So how he found this out, obviously, he narrowed it down based on, as you can see, the frequency of deaths are increasing closer to the pump. And what he saw was the quality of the water, which the pump was dispensing, it was connected with some sewer lines. So due to the sewage water leaking with the actual drinking water, people were ending up getting cholera. So an interesting visualization, and back in 1854, basically a source of motivation for us to do even better with all the nice tools we have. Another interesting visualization again around the same time. So Florence Nightingale, you folks must already be knowing, popularly known as the mother of modern nursing practice. So she had a very deep seated interest in statistics and also nursing, obviously. And she developed a polar area diagram. So the visualization she built was actually showing the causes of death due to basically for military personnel in the hospital she was working at. So as you can see, this is a visualization, not a very simplistic one, but she built this in 1854. And as you can see, nowadays a lot of you might be seeing this visualizations with frameworks like D3JS and so on, but nothing was there back then. But she still did it by hand. So that forms a source of motivation. Basically, these are the total number of deaths, the blue wedges. And these are caused based on preventable diseases. I mean, you can prevent the death actually. And the red proportions show the ones where the soldiers were actually wounded. And the black ones are basically the proportion where they died due to unknown causes. So a pretty interesting visualization even back in 1854. So let's talk about effective multidimensional data visualization. How do we visualize data in a structured way using the grammar of graphics? As you know, grammar is basically a set of fixed rules and structural rules which we use to define or give structure and semantics, syntax and semantics to any language like the English language and so on. And the grammar of graphics is a framework which follows a layered and a structured approach where using some specific layers, some semantics, you can build a nice visualization instead of doing some random trials and errors. So what is this layered approach in the grammar of graphics? This is exactly what it is. So as you can see, the lowest layer is the data, which is basically our data sets. The next part is the aesthetics. So what is the aesthetics? It's things like what goes on the x-axis, what goes on the y-axis? Do we need to show some variable with colors, sizes, proportions, things like that? Scale, if you need to scale some values, if you need to represent multiple values using some scale, geometric objects. So what kind of a plot it will be? Will it be geometric points like a scatter plot, which we saw earlier, or a line chart, things like that, or a bar chart? A statistics, this is optional if you want to showcase some statistical measures in the plot like a confidence interval, quantiles, so on and so forth. Facets is very important, perhaps one of the most interesting things because of which we're able to do or go to a reasonable proportion of higher number of dimensions where you can basically create subplots based on multiple dimensions, each dimension again is a feature. And the coordinate system, typically whether you want to go for a Cartesian system or a polar system, and so on. So a quick example of the grammar of graphics. So I'm using the plot 9 library. So ggplot2, as you folks might know, very popular library in R. There is an analogy in Python that's known as ggplot, but it's not actively maintained anymore. And due to pandas doing a lot of changes recently, if you try to do some specific things like try to plot the statistics, it throws an error. Plot 9 is actively updated, so feel free to check it out. So this is an interesting data set. A lot of you may have seen it, a basic car data set. You have the actual cars, the miles per gallon, number of cylinders, some other interesting things like whether it's a V shaped or a square shaped engine, whether transmission is automatic or manual number of gears, and so on. So a simple plot, you start building with data, aesthetics, scale, and geoms. So hopefully you can see the code. If not, feel free to check out the notebook also. You mentioned the name of the data set. The aesthetics basically my x and the y axis. Geom point basically tells us we are doing a scatter plot. And the next one is a theme basically. So how do we move higher? We leverage color and size here for visualizing up to four dimensions. So this is typically a three dimensional plot where we typically use color. And we can visualize what's happening. As you can see, you can build a nice inference here with regard to the number of gears and the weight and the miles per gallon. Now if you want to introduce a number of cylinders, we basically introduce it with the size here. Size equals CYL, if you can see it. And we can visualize both with the factor, the number of gears as well as the cylinders. So as you can see here, typically number of cylinders are higher, as well as the number of gears are lower, but the weight of the engine is higher. So you can build out some insights like this. And if you want to visualize some data aesthetics and statistics, what you can do is you can also mention what kind of a statistical measure you want like in this case I'm plotting a linear model with regard to each of the aspects of weight and miles per gallon. So with facets, you can actually visualize up to five or even six dimensions here. So in this case, we are visualizing four dimensions. So the difference here is instead of using the size, we are just putting a facet around it. So basically subplots, that's the gist of it. So you have your two dimensions, your third dimension is the color. The fourth dimension, basically the number of cylinders is used as a facet. And now if you want to move to 5D, what do you do? Typically, put in the size also because it's difficult to fake a notion of depth. Which we'll talk about later with another data set. So you have your facets here where 0 or 1 is basically whether it's automatic or manual, the transmission. Your miles per gallon and weight is here. Your size is based on the number of cylinders. And the factor is basically the gear, which is the color. And as you can see here that with automatic transmission, you typically have cars with higher number of gears. So how do you go up to 6D? Basically visualize with multiple facets here. So on the X, we put the transmission. And on the Y, we put the number of carburetors, I think. And as you can see here, you get some interesting insights. Like higher number of gears basically falls under automatic transmission. And also the size is huge because the number of cylinders is more. And here, number of cylinders is less typically. So this was a quick introduction into the grammar of graphics. I don't encourage using this library in general because as you all might be using, you use more of Matplotlib or Seaborn if you use Python. So I'll show an example around that shortly. Quick glance at the popular tools and frameworks. General purpose, Python and R. So these are the popular data visualization tools and frameworks. Combination of open source and commercial. So you have D3JS, Tableau. Again, it's very difficult to put everything on a sheet. So I picked some of the most popular ones and even some which we use. So Excel definitely one of the most oldest ones out there, right? And you have some interesting JavaScript based libraries including fusion charts, high charts, leaflet. There's Data Rapper, Plotlib, pretty much a generic framework you can use almost anywhere, right, even with Python and R. And Kibana, another interesting one. How many of you have used the ELK Stack Elasticsearch Logstash Kibana? Excellent, so we use the Elasticsearch Python Kibana. Typically we use Logstash if it's very straightforward for more complex data enrichment we use Python. But visualizations, you all must know it's quite interesting and often easy to build some visualizations. But definitely a lot of work has to be done, right? It's pretty limiting in some areas, so they are doing interesting advancements. How many of you have used TimeLeon in Kibana, okay, nice. So yeah, most stuff is coming out, so that's another interesting tool. Python data visualization frameworks pretty much, Matplotlib, Tandus, Seaborn, Boke, PyGal, Plotlib. Any other interesting frameworks which maybe I have missed out? Altair, yeah, I thought I would add it. I didn't get a nice logo for it, so good one. How many of you have used the Tidyverse in R? It's really amazing, right? So that's an excellent, I would say a combination of libraries. So GGplot to definitely one of the core libraries showcasing the grammar of graphics by Hadley Wickham, so definitely a good one. There is Lattice, there is Base R. GGruff is a nice one. It actually enables your GGplot to plots to be interactive. So definitely check it out if you haven't used it. Then there's TauCharts and Plotly. So visualizing structured data, basically another example using actual libraries, which we use, Dane and Dayout, Matplotlib, Seaborn. Start with the data as always. What's this data? You can check it out in the repository also. It's the wine data set, wine quality data set, a pretty common data set from the UCI machine learning repository. You have different physical chemical properties of wines, so things like acidity, residual sugar, types of wine, basically white and red wine samples. And the quality of the wine, like higher the number, the better the quality of the wine. And these are some examples. The complete description is in the notebook, so don't worry about it. You can check it out anytime. Things, what are the properties of different wines? So what can you start by doing? You can always do some basic descriptive statistics. Hopefully the code is visible. So you have basically two data frames which I had created and I'm visualizing some basic statistics. And as you can see things, interesting things like residual sugar typically is lower in red wine as compared to white wine. So hold on to that thought. We'll come and check it out in some visualizations also. But basically we start like this, right? We do some basic descriptive stats. And then we move on to start our visualization process. So one thing you can do is if you have a pandas data frame, just call a HIST and check out the numeric variables typically which would be showcased as histograms here. And you can see some visualizations as to what it looks like as a univariate analysis. And ways to visualize one dimensional data, pretty standard techniques, right? If you have a continuous numerical attribute, you can do a histogram. Those are basic annotations in case anyone is wondering. I'll maybe mention this once and we'll go over it quickly for the next one. So this mentions the size of the figure typically in matplotlib. This is like a super title as it's known as if you have multiple subplots, you can have one title for it. These are basically spacings between the title and the plots. Here I mentioned I just want one plot and x label, y label. I'm also showcasing the mean there as you can see by computing the mean and annotating it in the plot. And this is the gist of it. So this is the main thing. I'm doing a histogram on this axis. And I'm doing it on the sulfates. So I'm showing the distribution of sulfates. So histogram is one thing which you can do. You can also visualize a kernel density plot using C bond. So there is KDE plot which you can use and visualize the distribution. What do you do for categorical data? Bar charts, typically the most standard technique. So you can use bar charts and check out the frequency. Pie charts, maybe if you have three or four categories, wouldn't encourage if there are more because you see how difficult it becomes to interpret. So not effective. Going to 2D, what you can start with, obviously, visualizing feature correlations. So do a wines.corr. You get the correlation matrix and put it in a C bond heat map, and you get this nice looking visualization. Can anyone tell why the density and alcohol are negatively correlated? Yes. So higher the alcohol content, because density is measured based on water. And as you know, alcohol, if you put it in water, it will basically float. So it's inversely related. So higher the alcohol content, the density becomes lower. So these are interesting things you can always check out. Next part of visualizing 2D data, pretty much standard techniques. If you have two numeric attributes, you can visualize it using a scatter plot. If you use C bond, you can do a joint plot search. So I've shown both type of examples. So you can see what you can do with map plot lib as well as C bond. And obviously, you can check for correlations directly if there are any. If there are not, it's not there. And you can always do a pairwise plot. Not sure if the code is visible, but it's using C bond pair plot. And you basically specify the column names which you want. If you don't specify it, we'll do it for everything. And you can visualize correlations, check out what's happening, and so on. Visualizing two categorical attributes. So in this case, visualizing the quality of the wines and the type of the wine, red and white. So as you can see here, we have used a phasing technique, two subplots basically. And as you can see, basically a pane with map plot lib, not because of the huge lines of code, but more because we have to specify the x and the y values, like extract it out and specify. So this is a big pane. I'm sure a lot of you may have faced it with bar charts when you're doing with map plot lib. So what is there a better alternative? Yes, there is one line of code. You get everything. And you can compare it also. So use a count plot, specify the x-axis quality, use the hue. So we are using the color aesthetic here. Remember from the grammar of graphics, data is my wines as a data frame. And you can specify a palette because I wanted to customize the color. If you don't give a palette, it will assign some default colors. But just one line of code specified, you're done. So mix of categorical and numeric attributes. So again, if I have a numeric attribute, basically continuous numeric data, that's what I'm talking about. And a categorical attribute, like in this case, the type of wine, a red or a white wine, what can you do? So basically histogram distributions. And this is using a faceting technique. So you can also do this with c-borne. But what I will show, I'll come to it in the next file. But as you can see here, this is the core code. So we set up two subplots. So what this is saying is basically one row. And two means there will be two columns. That's what. So if I had four plots, I would do two, two, and then one, two, three, and four. That's how you build subplots. And here, this is the main aspect. So I'm plotting the sulfates. And here, I'm plotting the sulfates again for the white wine. So you have to separate out your data and do this. So is there a better way? Yes, there is. Use a facet grid combined with a distribution plot. So in this case, facet grid helps us in specifying the hue, again, using the grammar of graphics, where I'm saying that the color should be based on the wine type. Use a disk plot. It's basically a histogram. And what am I plotting on sulfates? KDE false means I'm not doing kernel density estimation. I am not showing a density plot, but the histogram directly. So using just two lines of code, besides the annotations and the other stuff, you can also do this. What about more number of categories? So here, we have just red and white. But if you have a lot of categories, what you can do is visualize using a box plot or a violin plot, where your x-axis is pretty much the categories and your y-axis is the numeric variable. You can visualize the distribution also using a violin plot. Or in this case, check out the quadiles, the overall outliers, and so on, using a box plot. So visualizing 3D, we have three numeric attributes. Is this effective? Not really, because as I said, here we are faking the notion of depth. I mean, we are introducing a third dimension as you can see here in the z-axis using alcohol. But it's still a 2D plot, so it becomes difficult to interpret as to what is really happening in this case. So what you can do is you can bin one of the numeric attributes as a categorical attribute. So because at the end of the day, you are still stuck with the 2D screen. And we as humans, we cannot visualize so many dimensions together. So use one of the attributes as a categorical attribute. So in this case, that is residual sugar. So what do these numbers mean? It is basically the quantiles. So the 0th percentile, 25th percentile, 50th percentile, and 75th percentile. So those are the ranges, and it's trying to plot it. Obviously, the color one shows some more indication, as you can see, where the residual sugar is typically there. Like higher residual sugar is pretty much here, as you can see, but it's still not that interpretable, especially the left-hand side plot. So you can introduce phase setting here. So now the problem is if you're using a phase set grid with matplotlib scatter, interestingly, in the previous plot, if you saw that C-bond automatically did it, like if you check this out, we just specified the continuous numeric variable there as residual sugar, and it automatically found out the quantiles. But if you try to do that here, your kernel might crash, because matplotlib is not that intelligent. So that's why you can use this interesting function, pandas, q-cut, a lot of you might be knowing, where you pass in the continuous variable, you specify the quantile list of the bins you want, and it will automatically bin the data for you and convert it into a kind of a categorical bins. And then you can put it here, and as you can see here, this gives us an interesting aspect. So actually, I used quantiles on both alcohol levels, as well as residual sugar, just to show it. But you can see here that, overall, the residual sugar level, based on here, if you check this out, the alcohol levels are pretty much, if the residual sugar is higher, the alcohol levels, or basically the fixed acidity in general, is pretty much low. And for the other ones in the lower ones, it's more spread out. So you can get insights like this, and even play around with other variables, and put this in and check out what's going on. So visualizing categorical attributes, if you have, I think, three categorical attributes, in this case, what you can do. So basically, you can use a combination here, with regard to the wine sample type, whether a red or a white wine. What is the quality of the wine? The higher, the better, or the more pricier wine. And these are the alcohol levels. So as you can see, higher the alcohol level in general, the wine qualities are also higher. Like in this case, lower quality wines are having a much higher frequency, as compared to the ones where the alcohol level is higher. So the gist is higher the alcohol level, the more expensive the wine is. Visualizing mixed attributes, so combination of numeric and categorical data. So if you look at it here, as you can see, you pretty much have a scatter plot, and in that case, a density plot, and you can visualize one of the categorical attributes using the hue, as before using the grammar of graphics. Some other examples here. So using a box plot for more categories, and the alcohol level is on the x-axis, and you can see, basically what we saw earlier, that higher the wine quality, the alcohol level consistently keeps increasing. Same with the wine quality class. And for the red and the white wines, it's pretty much not differing that much. So, yeah. Visualizing four dimensions. As you can see here, that pretty much some parts are interpretable, like the red wine, the residual sugar, which we talked about earlier, is lower as compared to the white wine, but other than that, it becomes difficult. So use a combination of the different aesthetics from the grammar of graphics again. So you have the residual sugar levels, and as you can see, higher the residual sugar, you can clearly see now that red wine samples are very less as compared to white wine samples, and also fixed acidity, you can check out like lower the residual sugar levels, higher the fixed acidity, and also more the content of the red wines. So red wines have a higher fixed acidity in general. Some other examples, basically, you can check this out like alcohol, volatile acidity, wine quality class, and also the type of the wine. So as you can see, higher quality wines have a higher alcohol level based on that concentration you see on the top. So, yeah. Visualizing 5D, so in this case, again the notion of depth, color, and now size. So it becomes more complex, as you can see, right? As we start trying to go higher and higher. In this case, we used matte-plot lip for this one. So one way is obviously start with faceting and sizes. So here we already have the hue determined by the wine quality class. The size you can see here is the total sulphur dioxide. So higher the sulphur dioxide, the bubble size will be basically higher. But as you can see, it becomes pretty hard to interpret it as the number of dimensions start increasing here. So here we are using a facet grid with the same things again, the column is the wine type, the color is the wine type, sorry, the color is the quality label, the wine quality class, and the columns are basically, or the facets are basically the type of the wine, red and the white. But the size becomes pretty hard to interpret. So what you can do here is do a multi-faceting instead of taking the notion of size. So what we do here is for the sulphur dioxide, which I said, I again created a binning for it because that is what C bond does internally. And then I use this as another facet basically. So considering a facet grid, we have now, not sure if you can see this, but these are all the things for white wine. This is for the red wine and sulphur dioxide levels basically are increasing based on the quartile. So as you can see, as we have seen there that with the highest level of sulphur dioxide there, the red wines pretty much are decreasing. So red wines typically have a lower sulphur dioxide level. And for white wines, it's pretty much not really varying that much. But as you can see here, alcohol levels as we saw, it's consistently higher in that case. So you can get insights like this by using multi-faceting. Size becomes very difficult, but in case of 6D, as you can see here, we pretty much introduce a notion of shape. So that makes things even worse. So as you can see, it's not really interpretable at all. So what you can do here is use colour, size and multi-facets because that is what you're limited to in the end. You have an X and a Y axis and that is the farthest you can break. So as you can see here, mixed attributes, colour, size, multi-facets. So our size is basically what we did with the total sulphur dioxide again. And our other facets are basically the wine type, the quality of the wine. So red wine and white wine and the quality of the wine and the colour is another categorical attribute. And our two numeric attributes are basically residual sugar and in this case, alcohol levels. So again, as you see, the notion of size becomes more and more difficult to interpret and that is always a challenge in general. Now for the fun stuff, visualizing unstructured data, I want to show that notebook towards the end. So let's see if we get enough time. How much time we are having around? Okay, cool. So visualizing unstructured data. So what do we do with unstructured data? It becomes more difficult, right? Text data, image data and audio data. So starting off with some basic exploratory analysis. This is the Bible corpus. You can get it from NLTK, from the corpora there. And these are basically the different lines of the Bible corpus, some of the basic stuff which we do. Check out the length of the sentences. So as you can see here, typically a length of a sentence, the maximum length or where the majority of the distribution is there is between 60 to 70 characters. You can even split each of the sentences and see what's the typical word length. So for the sentence length in terms of words. So you can also visualize that distribution like typically 13 to 15 words are occurring the most. Next, what we can do is visualize language structure which we can use later for other things like POS tagging, building custom POS taggers or feature extraction. That is why we do this. So shallow parsing code is already there in the notebooks, so don't worry about how to do this. All the code is there, you can check it out. So shallow parsing is basically also known as chunking where we extract out the phrase, different phrases of any sentence. So as you can see here, the typical form here is NP, VP, NP, which means a noun phrase followed by a verb phrase and then again a noun phrase. And inside each noun phrase, you typically have things like a proper noun and then different, these are known as parts of speech tags. So pretty much if you just search POS tags, you will get the full list and what it means. Constituously parsing is another parsing where typically you break a sentence into substructures as you can see in this case, like you break a sentence into a hierarchical structure or a hierarchy and you can visualize the dependencies and what's happening with regard to a sentence. The next form of parsing which we can use to visualize what is actually happening in a sentence and also very useful for things like finding out dependencies, doing named entity recognition, things like these is basically dependency parsing where you can find out what are the main dependencies occurring between different entities in any sentence or basically multiple sentences and so on. Now how do I visualize text data? I mean actually like there is so much unstructured data if you follow traditional models like bag-of-words or TFID of things like these. Each feature is basically a word, right? So what do you do there? I mean you can visualize things like the frequency of n-grams, build out a n-gram cloud or a bar chart or whatever, but I mean word clouds are cool to look at but they are pretty much useless. So when you are building stuff with text data, what you can do is embeddings as today morning the talk was there, right? They are so important. So visualizing dependencies, semantics, you can use embeddings. The code for this is also there in the notebook. If you are interested in building your own word embedding, you keep hearing skip-gram method. Maybe you have used jensim.word2veg but how is it really working? Unfortunately, we won't be able to cover that but check out the code. It's there built in Keras so you can use it and even understand what's going on behind the scenes. Briefly, continuous bag-of-words model, what it does is you have all the context or the surrounding words and you have to guess what is the target word or basically the center word. The opposite is for skip-gram. You basically have a target word. You have to guess what are the context words. How it's implemented is slightly different. Do check it out and what do we get from this? So you feed in a corpus of data and essentially that's why it's known as word2veg. So literally a word gets converted into a fixed length vector or an embedding and then you can do a lot of fun stuff, right? Like we heard today the king, queen, man, woman stuff so you can start visualizing entities which might be semantically similar to each other. In this case, this is from the Bible corpus so some interesting things as you can see here like famine, pestilence, diseases coming close to each other. This is Noah and these are basically the sons of Noah so being clustered together. Jesus being clustered together there as you can see and also things like God, Christ, gospel. So some interesting stuff in general which you can start finding out and seeing that what entities are occurring close to each other. Check out the synonyms and semantically again not syntactically but why do we need all this, right? So considering machine learning, one of the great things is you can visualize documents based on the similarity once you have the document vector. So just like imagine it as a clustering problem but you're not really doing clustering. So what you can do is in this case, the full code is again, there is an example. We are just showing some specific samples so you understand why we are doing this. So you have some simple unstructured text documents and a basic category as to what the category, what's it all about? Like whether some are talking about food, some animals and so on. So what we do is we put this through a word-to-vec model. In this case, I have just used 10 features for basically the length of the vector. So each word gets converted into a fixed 10-dimensional vector. So what can you do now? Convert the document into a vector, right? So there is doc-to-vec. What you can do in this case is take the word vectors and just average them, that is one strategy. And for each document, now you have a nice vector. Now what you can do, just take a PCA of this and just visualize it. So then, as you can see here, the documents which are related to animals are kind of close to each other, the ones which are related to weather close to each other. So this is exactly like actually doing clustering, right? So you can start visualizing dependencies, you can start visualizing connections, things like that with word embeddings. Definitely very useful and a lot of you must have used it but these are the ways you can actually start visualizing and seeing what's actually happening with word embeddings. Image data pretty much all of you have been, or those of you who work with computer vision use cases must have been using it. So visualizing image data definitely a bit difficult because especially if you consider a grayscale image, you are kind of good with two dimensions but otherwise it becomes an n-dimensional tensor. So like a 3D image has three channels. So it's like considering an image of size 200 cross 200 pixels, it becomes a n cross 200 cross 200. So n is three channels in this case, red, green and blue and each image is basically just a 2D matrix at the end of the day. So as you can see here, you have some image data, you can just load it up to see what it's out there and then you can separate out the different channels using a technique like this to see how much of the red is contributing the blue and the green and what are some techniques to visualize image data from which we can extract features. So we can start with traditional image processing techniques. Those who have been working in computer vision will know this maybe even better than me. So you can start with checking out the image intensity distribution. So in this case, you just plot a histogram with regard to what is the overall intensity distribution with regard to the pixels and then what you can do edge detection. So check out the edges with regard to the images and extract those out as features which can go further into your model. Again, we are not really visualizing just for the sake of it but we want to actually use these features in the end. A more interesting thing is the HOG known as Histogram of Oriented Gradients. So here like SK Image is a nice open source library. You can use that it has Histogram of Gradients and what does this give us? I mean, this is an image it's cool to look at but what is this? So it basically counts the occurrences of the gradient orientation in basically specific localized portions and then you get this nice feature descriptor. So what is this? A flattened feature vector for each image which you can feed into a classifier or a model and so on. But now things have definitely changed because you can visualize image data using convolutional neural networks as there was a talk even in the morning about the importance of deep learning. So pretty much if you are building a facial recognition system you have a layered hierarchy in this case. So again, this is a dense fully connected layer but ignoring that like in the lower layers it learns more localized representations like the eyebrow or the edge or whatever the eye and so on. And as you go higher up the feature representations become more and more complex. So at the end like as you can rightly see it kind of recognizes the full face. So feature representations are great because as even the Spark talk which was there you saw the importance of transfer learning, right? So you use a pre-trained model which was trained on the ImageNet corpus of a thousand images and you can just feed a new image to it and even check out at each level what is happening in the network. Visualize it because if there are misclassifications you need to always go back and check what's happening in the network. So that is the most important thing. So the next part I have an interesting thing which we recently did. So can you see what you hear? That's the main thing, right? So how do you visualize audio data? You just hear music day in and day out or different things. So this is the urban sound data set as you can see. So what we have done is we have visualized the data. How do you visualize audio data? The first thing you can do is basically build a wave or an amplitude plot where you can see the amplitude of the different sound. So the urban sound 8K data set is basically 8,000 samples of 10 classes like street music, engine idling, a gunshot, siren and so on and the problem is can you build an audio classifier at the end of the day? So as you can see gunshot pretty much in the beginning there is a huge sound and then it's idle, right? So based on the pitch you can understand what really happens. So how do you visualize audio data? This is one way but the most effective thing is to use a spectrogram. If you check the definition of spectrogram it's literally a visual representation of audio data. So using spectrograms what you can do is you pretty much have an image now, right? So using this image, think of the possibilities. You can use any pre-trained image model and you can start classifying sound and that is exactly what we did. We used the spectrograms. We used spectrogram based on the harmonic and the percussive components as you can see here with this code. The library is Librosa if anyone is interested, open source Python library and this code you can pretty much it's extensible. You can use it on any audio samples and what was the use case we did? So audio classification. So you have this audio basically like a player. So this plays the music. It's just a representation. You take a sub-sample of this data, get a spectrogram and do some feature engineering here to build kind of like a three-dimensional image basically. So using this what you can do is you have this nice, think of it like an RGB image which you get from actual audio data and now you can just leverage a pre-trained model, right? Use a VGG and inception a resnet and train your model using this data and you can basically these are the results if you see for 10 images, this was trained with VGG 16. So overall accuracy close to 90% is what we got. I trained it for around 50 epochs mostly. Let's quickly check it out. Then I'll go to the final words. So this is the sound data as you guys can see, I'll zoom it for you. Is it visible? So we just get the sound data, we use Librosa to sample the data and then this is exactly the same thing I showed you, right, the spectrograms and so on and then what we do is we build this feature engineering model. This is exactly doing that diagram which I showed you, building that RGB image or you can think of it as RGB, it's not an RGB image and we are getting this nice 64 cross 64 cross 3 feature sets and then what we do is once we have this leverage VGG 16 as a feature extractor, we are not doing any fine tuning here so maybe that might improve the performance even more. I didn't test that and then extract the features, this is just bottleneck features. So you are just putting the image, getting out the features and that is your input data, fit a dense layer at the end of the day, that is what we do and then we predict it. So that is the training, let me go to that quickly. So these are bottleneck features which I stored which comes out from the VGG model. What do I do next? I put it four dense layers as you can see. So this is the architecture. So the features from the VGG go into this dense net and we train this model for 50 epochs. Code is there, you can check it out and you can even reuse it because it's there. It will be coming out in my new book around transfer learning. So you can check it out. The code will be there on GitHub. I haven't put it yet so you folks are the first ones to see it. So as you can see here, this is running for 50 epochs. This is exactly the same thing and how does it do on the test data set? That's the main thing, right? So that was just validation. So on the test data set we get similar scores as you can see 89% overall considering precision recall and the F1 score. And this is basically the confusion matrix as you can see diagonal is better and you can see here street music gets confused with children playing because it's outdoor. You can understand the thing, right? So you can start understanding what's going on behind the scenes but this is an interesting case study where audio data you converted in the form of images and that is exactly what the first talk today was about, right? Transfer the knowledge from one domain to another. So these are the ways where you can not only just build the model blindly but visualize the data, check out the amplitudes, the spectrograms and then build the model. So quickly the final words maybe I'll try to wrap this up in one or maximum two minutes. I know I'm running out of time. So what if you have a lot of features? Use domain expertise, use modeling to get the most important features, feature selection, dimension reduction techniques. We talked about it using PCA, SVD, T-stochastic neighbor embedding. That's what we did to visualize the word embeddings. You remember the Bible ones. Always remember to scale the features, remove the outliers as needed always because very important and if people keep pestering you about how are you going to visualize so many features? Just show them this file and that should take care of everything. So if you want to visualize 17 dimensions, visualize three and say 17 over and over again. That's how everyone does it. Jeffrey Hinton has said that. So some promotions towards the end. So what we use in our work, we are using the Intel distribution for Python. Many of you might be already using it like Intel, Math, Cornell libraries come by default with Anaconda now. So how you can get this, it's pretty much free to use. Just do a conda, config, add channels, Intel. Anyways, they're in the slides and you can get this. It typically runs faster on Intel architecture systems if you are using that. And last but not the least, some references, you can check this out. From where I, so these are some of my articles where similar code is there, which I've reused today. The paper, that paper is around the grammar of graphics and the code is already there, which you can get from here. And if you're interested in collaborating, research, writing stuff, reach out to me. If you're interested in writing for towards data science or even collaborating with me, my LinkedIn is here, GitHub is here. So that concludes the talk, open for questions. Oh, we have time? Okay, you can reach out to me, I'm here. So, yeah, it's okay. I'll be there, so feel free to reach out.