 Hello everyone, I am Akshay Burkett, I am a software engineer at Red Hat, working for Red Hat Insights. And here is my first talk in the dev account. So this is the first time I am presenting in front of such news of people. So yeah, I am starting with whatever I have and try to explain as much as easily to you that can add up. So this is about the exploratory data analysis, those are techniques that we follow usually in the data analysis part. So there are lots of things like today's world, it's a AI and machine learning thing. Everything is, everyone is just wanted to have a magic like nowadays, generative language models like chat GPTs and the bar and all like you know they are just huge popular. So but everything is started with the data, the data collection and then processing and then training to the models and all the things. So this is the journey, whole journey and I am just explaining one part of the journey related to data. Because if you want the output, it should need a data that is the input for all the journey. So in that we will explain how we can process that data, how easily the data should be managed and processed to models so that they can adopt it and quickly to perform the operations they wanted to do. So yeah, move on with that. Here is the agenda for today's session. So we will first introduce about the EDA, then how, what differs EDA and data analysis. And there are the techniques that the data cleaning, the visualization, there are some libraries. So I mostly speaking up with the you know correlated the Python language that is the most popular language in data science as well. And then decision making and business strategies on that and then we will open for the Q and A sessions. So yeah, moving on with that. Is there any questions? So please ask in time. So this is the simple one, the path, the raw data, the between exploratory data analysis and the output. So it is the, no the exploratory data analysis, before that we will understand the raw data. Raw data is multiple types, the text, images, anything like that. In the text we can, I mean like suppose we are looking into the large models, language models using the data in form of you know reading, understanding the knowledge books, journals, articles and all this. So raw data is in multiple types of form. We can collect mostly, mostly we try to gather in the CSV files, the text files and then from that collector to whatever the narrate data, I mean the clean way. Whenever the data is generated it is not in that clean way that models can adopt it. So that the raw data, the collection of the process I think nowadays the huge data is generating. Because we are all using internet and for every click thousands of lines of data is generating and it's the data, there is no limitations for data in today's world. But looking into past three years like maybe 15, 20 years back. So collecting the data is a big challenge at that time, but not today. So now we have huge data, but now going on is the data analysis. So the data is in unstructured way because we need to data in a perfect format so that we can get some you know understand patterns from it, get the analysis, we can improve our decision making system, try to do that. So that needs a data analysis thing. And in the exploratory data analysis, you know it's somehow critical process that basically performing the initial investigation of the data that discover patterns, anomalies, then hypothesis and to check the assumptions as well. I mean the first, it is a good practice to understand the data first and then gather the insights from it as much as possible. It is an approach to extract information from unfolded data to summarize the main characteristics of the data. Because I mean in the many times it happens that most of the people underestimate the importance of data preparation and the data exploration. It is the important thing because if we have well-defined structured data, the projects that are using the data to manage, to learn that, train their models, it will be easy and with the very minimal time, the output will be there. So and as you know the output will be the you know in the case of expert data analysis, the output is the data analysis. We will extract the patterns in different way, we will provide clean data and the enriched data that means it's all cleaned and statistically input. So I have the next slide here like because it's EDA and the data analysis. So how much, how it can be for exploratory data analysis and data analysis is kind of you know confusing thing to everyone. But I think this table will help. So basically the future is the goal of the data analysis is answer questions, make predictions, gain better understanding on that. But in EDA it's just to you know gain insights from the meaning of the data and you know they collect such information pattern from the data. Nothing about the model training and nothing about the machine learning all the things. The time, the future time is the EDA is early in the data science process as I said the initial process in that. And the after that the data analysis. So we can say that EDA is a part of data analysis. So the data analysis of huge big tree and it is one part of the EDA is one part of that. Looking into the formula in future so EDA is less formal and data analysis is more formal. Because as I said data analysis is a you know big part and the EDA is small part of that. The methods are the same I mean all the methods using the EDA as already in the data analysis. But in EDA specifically I mentioned that the visualization. The visualization is basically too much focus in this. Because many times it happens that there is no need of you know going ahead with the machine learning things to do such stuffs. We can get it from you know just analyzing the data itself there and the visualization will solve our problems. So visualization is mostly important because there are going ahead we will explain about the what are the types of visualization as well. But this is the important step and you know as a human tendency everyone knows that our mind captures the visualization first. Anything later than the reading or hearing or something like that first. This is the important and this solves here most of the problem. And then the data analysis part is the statistics modeling, machine learning. Then all the stuffs more in training, testing and then gathering. They are not providing the output to you know check and test our data on that and getting output. And then still if it is gathered we need it also need to put efforts to get you know it's getting calculated in this accuracy and all the things. How much it is accurately performed there. So this is the all thing differentiates the both the data analysis and media. Is there any question till now? Yeah. So going ahead. So first we will explain the data cleaning techniques. So there are several techniques but today I just wanted to highlight few of the techniques that are missed many times. But that are quite important in that stage. So the first one is the handling and missing data. So I will just give an example like we have bunch of data in this collected in the spreadsheet file, CSV file. And some it contains that suppose there is a some data is missing in the particular column and rule. But when we analyze that it will about us to you know getting the expected accuracy in that area. So how we can handle this? So we can first do that it's the first step is to identify the missing values. So those will be identified in Python there is a functions like is null or is NA to identify the missing values in data. So in that way if you process some work like it like we are using the pandas library in the library Python library. And it is a popular library in using the area technology as well. And we will just create the data frame there and data frame we run against. We'll check that is it is there any missing values in the particular column and rule using these functions. Then there is next stage like we found missing data in that. So we can either delete the particular row or we can either fill the value in that. So that anyway we suppose we wanted to delete the simplest method. We can use the drop any functions and it will simply delete all the rows and columns contains the missing data. So it's that simple in Python. And the next one is the impression this involved replacing the missing data. Like as I said there is a few columns and rows having missing data. So we can fill that data with mean sometimes median standard deviation anything like that. But it's all about the data we are working with. So these things we can do as well. And the modeling as well helps us to handle the missing data. There will be models that helps to handle the missing data very well. So no need to go into that much raw things. The next thing is deleting the outliners. So first the step in that part is to identify the outliners. To identify the outliners I mean it is very much important very much important thing because we have a data of age suppose we are wanted to process data in some age group of suppose 10 to 20 of the people we are analyzing data in that. And suppose there is a data of age of 60 and 70. So these are the outliners because when we are processing on only those part but there are some data that are not correlated with the what are our expectations so that is also important to gather because when we are trying to get a mean of that data suppose we have a problem needs to perform operations on 10 to 20 age group people and we just suppose get a mean of that so it's been close to that but if any number is 60 or 70 in that age group that will break that accuracy in very short time in that way. So identify outliners. The methods like zscore and visualize data using the box plots or scatter plots I mean the scatter plot is very much popular in that way so we just see that yeah these data are outlined outside that our predicted curve then evaluating those outliners. So once we have identified the outliners to evaluation determine with the valid data points so sometimes there may be error while generating the data while collecting the data as well we can check in that way or the same either we remove that or we'll just figure out why the data is wrong in that case so that was also one thing. The same mentioned the input and outliners as well it will remove from the data set or leave a gap in the data so can be filled with the missing value. And one more thing we can deal with outliner with the transforming the data so we can either fully transform the data with whatever the average or the correlated data we are expecting in that situation so that's what about the dealing the outliners. Moving ahead we have few more techniques on the data cleaning the next one is the data standardization and normalization so the standardization in that is this technique normalize the data in the standard format like we use mostly the standard deviation in that like only in median like those things because we need a data in a particular range in that situation so in that the standardization data standardization is very much important like there is a data of either true or false data will be there so it should be in standard format it's either true or false not more than that I mean nothing else from that so we can take that and with the help of some of the tools like scikit-learn provide the standard scalar can simplify this process so scikit-learn is also very much popular in the machine learning library and it helps to data cleaning data pre-processing part that the standardization will do the next one is the normalization so this is also a technique that will work like a min-max approach in a specific range like suppose there is a data of a huge number starting from 0 to millions of numbers there but it's not possible to work on few of the numbers that are out of our range we can read so in that case we just normalize in the scale of 0 to 1 or either we can substitute in which we want to normalize in the specific range in that way the data will be easily readable and it's human readable and easy to because it will not change anything in the processing of the model it's just we are changing the range of the data and there is a min-max scalar tools in the scikit-learn that helps to normalize the data mathematically we will say that if there is a data of large scale like we will go that point suppose there is a decimal points number as well and fractions number but it's hard to work with those numbers to normalize it with the scale so basically 0 to 1 scale is very popular in the data analysis state and because mathematically there is also few theorems that can show that 0 to 1 normalization is basically very much helpful and quite good to process with whatever our data analysis thing going on so that's the thing about the data standardization and normalization the next one is the handling duplicate data so as far as we've seen the handling the missing data the handling outliers and the data standardization normalization but there is one more thing the handling duplicate data as well so while collection data multiple entries will be duplicated if we have few checks there while collection it's okay but the data should be duplicate data will not be that much helpful to improve the accuracy and get the output from that so how we can handle that so these are all the steps we need to perform while cleaning the data and while handling the duplicate data we'll check that all the duplicate records with the function duplicated against the pandas data frame so all the duplicated data will be identified there and we can just I think it's a standardized process to remove that duplicate data there is nothing we can do with that like doing standardization or normalization with that because it's a duplicate data we need to remove that to get things done and the removing of the duplicate data is also a simple with the drop duplicate function in the function of the pandas library against which kind of data frame so it is that simple to remove that all the variables so this is I am specifically talking about the python thing there are lots of other tools we can manage I mean we can do those things there are too many softwares as well that help to data cleaning the models are creating the models are helping in that way but going with the basic if we understand how basically it's working in the scratch so the next things will be definitely good to understand all the things so this is the four techniques I think that is much more the basic one but which is important in the data cleaning process there are few more data cleaning techniques like these are the the data maintaining data integrity encoding category variables, data integration then cleaning also documenting the data cleaning steps is also one more important thing I would say that what we are doing is also you know every time in general way we are trying to document what we are doing so the words in next will come there try to you know adopt those things very quickly and so that is the important step when we perform the steps on the data set so to document it is a good one a good step and the iterate the data cleaning like in the various these steps we need to iterate multiple times as well it need not be you know sure that we will get data in that clean way so we need to perform various types of techniques on that with all of the understanding so moving ahead I would like to show the data visualization techniques and I also I will explain the few of them will be you know important in that step and very easy to understand as well the first one I already told about the scatter plots and the scatter plots is the mainly used for the doubt detected outliner so there is the these are the outliners the points the graph shows about the cost and the weights so whenever we are talking about this it looks like much more centric and if we draw a line from here much more linear and it is adjacent to that so it looks like it is not outliners or anything like that so we can easily understand that the data will be good to go and linear models we can use in such type of data to get the output from the linear suppose we are using the regression techniques then linear regression which regression technique we want to use is it will be helpful so the it is the widely used to relationship with continuous variables the two variables continuously it is linearly data we can collect if there is any variable is here so I mean it is the outliner and we need to perform the data linear techniques like this the outliner techniques over that the next one is the path charts so I know everyone is used the path charts I mean many times in the excel sheet or spreadsheet we are using path charts to just do the thing so I just created a children data here that represent the children and their preschool, primary school, secondary school those are the color that how much the the school children will be using that and all the in that category so it will I mean this is too much simple no need to explain much more on that so is that the magic of visualization I don't need to explain much of this going on with the histogram so it's a graphical representation it's a distribution of continuous variables so there is a data bin data is distributing the bins and display the frequency or count of values following within the each bins and this help to actually shape central tendency spread the data and use school for identifying the data skew as well as the identifiers as well so histogram is also most popular one in the visualization technique and it's very much helpful in that going there I'm also explaining few of the data techniques here there are lots of visualization techniques nowadays I mean if we are using the tools like Tableau and Power BI and all the same so it's very easy to do that but we need to know about which techniques we have and how for which data which one suits well so the line plots the first one is the line plots and it helps to visualize the frames over the time and the continuous sequence so this is about the time sensitive data we can plot using the line plot the next one is the box plot the box plot data so I will not put the data here so we will just explain so it will provide the summary distribution of the data and it displays the minimum first it displays the minimum first quantile mean, median, like the third quantile the maximum value, minimum value if there is any outliers like that so this box plot will help to spread skewness, central tendency to spread the outliers of the data then the heat maps so those are heat maps also popular like for example I am giving the weather data we can visualize using the heat map with different different colors with different different color shades on that so this is the color coding visualization heat map is a color coding visualization of the density and the magnitude of variable across the multiple categories and the dimensions so heat map effective for showing correlations patterns and clustering the large data set they are commonly used for analyzing data metrics for multivariate data so there is the types of data multivariate, univariate, bivariate so univariate is simple data and multivariate content of multiple types of data so these types of data will be easily visualized using the heat maps going ahead with the pie charts so pie charts also pie charts everyone also knows about the pie charts the proportion or the percentage of the different categories within the data set we can show with that if we visualize the composition the whole relative sizes of the different different categories will be visualized using the pie charts on that but it is only numbering, it is limited with only numbering on that then geographical maps as it name said geographical data we can visualize it will show the distributions across the regions, locations I mean any city wise country wise we are able to show the pattern from that specific to any particular data we are working with that like a population like any other things we can highlight with the geographical maps then there is the interactive visualizations so like I said the tools like Power BI W there is a library plot line math plot line simple one but the plot line is providing the interactive visualization so that the next level of all these techniques will give the much more detailed data detailed data representation from the data provided in that area so all these data analysis techniques facilitate better understanding of the data set and the visualization is the most important thing as I said it shows many colors in that area going ahead with the libraries mostly use the EDA so pandas is the first and foremost the most popular one from the multiple sources we collect the data in pandas and pandas store it under the data frame it is easy easy to process with the rich function set of the pandas library that will help to process the data like cleaning cleaning thing will do with that and much more things with this because it has data frames and series and this is used mainly in the pre-processing part and the initial data in the EDA process the numpy is a mathematical computation related library in the python it provides the data structure mathematical functions and operations on the we can perform it with the arrays matrix and the pandas data frames in the way macprolite is the visualization library in python is a popular one one of the most popular one and it has a variety of basic basic plots like line plots scatter plots bar plots histogram and many more so visualizing data distribution relationship and pattern during EDA it will be easy with the macprolite library there is a seaborn one this is the high level data visualization library so those are the library plotler is also the interactive one and the scikit-learn is machine learning and the library will help to pre-process the data these are the few decision making and business strategy to identify the data driven opportunities those are the things so we have time so we will just go on a QLS session and just read those decision making and the strategy we can gather from the EDA its violated assumptions then mitigate risk, optimize resource alterations then improve customer understanding enhance product development, optimize operation and process and monitor key performance so we can have all these things collectively from all the EDA process so yeah going with the QLS session any questions anyone? no questions no questions online as well? yeah thank you so much