 If you know you know that I'm a junkie when it comes to data visualization it's really lovely for me just to see that those plots and it just brings out the knowledge that's hidden away in that data. Now in Julia we spoiled because we have many packages when it comes to data visualization Gatsfly comes to mind which produces wonderful plots, plots itself, it has many back-ins that we can use but today I want to introduce you to Vega Lite, the city one of my favorite data visualization packages. I'm going to open a notebook and interact and I'm going to show you just how easy it is to create beautiful plots in Vega Lite. Here we are in our interact notebook learn data visualization using Vega Lite. So really plotting is one of the most useful ways that I find of summarizing data I just love that visualization it really brings out that first bit of information from your data and as I've written here the adage of pictures worth a thousand words that really comes to mind. So Vega Lite is a graphics grammar based on JavaScript object notation JSON and many of you will be familiar with JSON. This is a programming language agnostic format for interchanging data and as such then Vega Lite itself it uses a subset of JavaScript and it allows for rich interactive plotting and Vega Lite itself of course that's a subset of Vega and Vega is a declarative language that generates visualizations for the web and you see there it can use canvas or SVG scalable graphics format. So Vega Lite then the Vega Lite.jr package that's the Julia representation of the Vega Lite grammar. It really is ideal for visualizing tabular data so we've got to have this data in tidy format or tidy format or long form that means all your variables are across the tables in your flat file and each row is a subject in your data. So as long as your data is in tidy form or long form it really is going to to work well. So the Vega Lite.jr package can work with many data structures. So we're going to start off by just using Julia arrays the type of Julia collection but it works really well with data frames objects and we're going to import the data frames package for instance after we've generated some data. You can also use it with the CSV package for instance remember if we import a spreadsheet file a CSV file using the CSV.jr package it's going to generate a data frame object for us anyway. Remember there's also the Vega data sets.jr package and that creates or at least is a Julia representation again of the Vega Lite data set. So with the Vega Lite itself there is a data set so you can play around with some of the data sets that you just import with the Vega data set.jr package but we're going to simulate our own data because this is Julia. So what I'm going to do here is just import the packages that I'm going to use. So you see I'm going to use Vega Lite. I am going to import Vega data sets because I'm going to show you just the first plot just to give you an idea of what the Vega and like plots look like but thereafter we'll do our own data. So let's run that cell. With interact it's the same as with the Jupyter Notebook. I'm just going to hold down shift and hit return here on my Mac and that's going to execute that code. Of course it's got a pre-compile now and it'll take a second or two to run. Now this .jl file of mine it lives inside of its own Julia environment. I do have a video on creating Julia environments and it's very important to create different environments for all your projects. Don't just import packages in your base or generic Julia environment. Now we're also going to import random distributions and data frames. So let's do that random because we're going to generate some random values distributions because I'm going to take random values from a normal distribution and then data frames as I said it is one of the best data structures to use with Vega Lite. So let me just show you what a Vega plot looks like and there's a couple of lines of code and two things that you're going to note first of all that a couple of things. The data set is the population data set that I'm using from the Vega data sets and then you see this little operator there. So on my keyboard standard Mac 16 inch keyboard 16 inch MacBook Pro keyboard I should say that's to shift and backslash and then the greater than symbol and that's the pipe operator because I'm piping the data into the following thing and the following thing is this macro at VL plot macro and remember macros are used in Julia fantastic I think using Julia it just generates code so see it at the moment just as a function so I'm piping the data as first argument into this function that's what the pipe operator is going to do for me and then we're going to see some arguments the first argument I'm going to use is background equals light gray and then I'm doing a mark and you can see these the set of curly braces now it's going to it's going to look like at first glance like Julia dictionaries but it is not Julia dictionaries it's just part of the syntax of JSON and how the at VL plot macro converts the code that we're going to write here into into normal JSON and in actual fact there's also a VL string macro that you can write displaying common garden variety JSON to create these plots where we're going to stick to the to the common use case here which is the at VL plot macro so there's a mark and it seems to be box plot but I'm using symbol notation so it's colon box plot there's an extant equals 1.5 an opacity there seems to be a title there's an x axis and a y axis the y axis seems to have it seems to have a title etc we're going to go through all of these things and of course the first time you run a plot in Julia it's going to take a second or two shouldn't take too long and then we're going to see our first plot remember I've stated here a little light gray background and there we see it a lovely box and whisker plot and you see our population count here as our title that we that we stipulated there in the y axis we see age here and we see these values 0 5 10 they sort of on their side minus 90 degree angle and that is because age is seen here as an ordinal as an ordinal data type and not as continuous numerical and it's ordinal because of that colon oh there after the age anyway we're going to learn all about that you see these statistical outliers here for the population count here on the older ages and you can that is because of this extent equals 1.5 great stuff that is what these plots look like they're very easy to create it's it's quite a nice plotting package and I'm using it more and more so let's start with the basics of a Vega light plot so other than the data that we're going to have to use when we're plotting data at a bare minimum Vega light requires two things an encoding and a mark those two things there's also transform that's an optional third thing that you can do on your own and that's where we want to transform the data that's just the statistical manipulation of data there's for instance there's an aggregate aggregate transform and one of the aggregates is a mean or median or standard deviation that you manipulate the data first before plotting that's a transform usually though when creating the plots we're not interested in that transform so what we really just need is the two two things a mark and an encoding so let's generate some of our own data and then I'll tell you all about marks and encoding so I'm using a pseudo random number generator seed so I'm using random dot seed and then seed bang remember that's our function and I'm just using 12 and if you use 12 as well of course we're going to generate the same pseudo random numbers I'm creating two computer variables independent and dependent my independent computer variable holds a unit range object so I'm saying from 0 to 9.9 and steps of 0.1 and then I'm just going to for each of those I'm just going to add a bit of random noise and that comes from a standard normal distribution so independent dot plus remember the dot plus gives us element wise operation and as many values as there are in my independent I'm creating in my dependent here because I'm just using the length of independent so that's all the same so let's put this together so the mark is what we are trying to plot the type of plot so this at vl plot and what we want to plot is the mark and that is point point colon point so symbol notation that infers that we want to scatter scatter plot because we just have these points and then the encoding is the actual data that we need to encode for the plotting and the encoding is what goes on the x-axis and what goes on the y-axis and we're just stipulating the two our two list objects there independent and dependent and if we plot that we see a beautiful scatter plot just like that and we see the mark was point which is scatter and we see the x and y values each pair of values for each of those dots and then we have a just a nice little scatter plot now we we need not only encode the x and the y-axis we can actually split the data and we can split the data according to the sample space elements of some categorical variable so let's create a categorical variable i'm going to call a group again just from the ran function i pass a list to it with two strings a and b and i want a hundred values please and now i've got my third encoding so i've got x y and now color which is nothing to do with actual color color pink orange red and blue it's just splitting my encoding up according to the sample space elements in that categorical variable so if i do that we see we have still our dependent and independent variables but split up by whether it's a or b and we see the legend on the right inside a and b so the three parts or the three encodings that i can do x y and color now instead of the single plot i can plot a and b those two sets of values separately and instead of using color there's a sneaky little fourth type of encoding that we can do and that's the column encoding so if we run this i'm going to see two plots one separately for a and one separately for b so instead of color we can use column as well now that's very nice if my third variable was categorical but what if it's continuous numerical so let's create another computer variable and i'm going to call it scale and it's a random just take random values between on the interval of 10 to 20 and i want 100 of those and now my color is going to be scale but scale is now a numerical variable it is no longer categorical so what is going to happen here well very nicely because it's a numerical variable we're going to get this scale here so it just goes from light to dark and that allows us to take a scatter plot and visualize three numerical variables in one go the darker these little marks the higher of course that value is for that third numerical variable so that's brilliant now we're going to get into what is very common and that is when we encode a categorical variable by numerical values so we might have questionnaire and someone disagrees they neither agree nor disagree or they agree whatever the case might be and we just encode that in the data captures one two and three and that means that these are categorical variables even though they numbers but they are not numbers and we've got to tell vagalite that we want them interpreted for instance as ordinal data or nominal data so let's create this new computer variable grade and that's going to be a uniform random distribution of these three values so just pick from one two and three a hundred values and they each have an equal likelihood of being chosen so there we go and now what we're going to do is because we've got so many computer variables now let's just build them all into a data frame so I'm going to call my data frame df and remember that's the notation for data frame we don't use when we create the data frame these are the column headers independent dependent group scale and grade and uppercase I use uppercase just to distinguish that these are column headers in my data frame and then we can assign to them these one two three four five lists that we created and then I'm using the first function there and I'm passing the data frame comma three because I want the first three rows to be printed to the screen just to see that everything came in properly and then we see the first three we see my column headers independent dependent group scale and grade independent was correctly identified as a 64 bit float data type 64 bit float for dependent the string I might want those as categorical and remember the data frames package it does have a two categorical function scale is a 64 bit integer but look at grade it was interpreted as a 64 bit integer but that's not what we wanted we wanted it to be ordinal categorical in other words not numerical at all so what can we do because if we do that we're going to see a color scale we should see a color scale because it is seen as a numerical value so here's my mark it's plot but look at that I'm piping data frame into so my data is now piped into the Advil plot a macro so df and then my pipe operator and then Advil plot so that I'm just piping that as the first argument into that macro my mark is still point my x axis is independent but now I'm referring to the column header not to my computer variable that I had before so this is one of the columns inside of my new data frame so I'm using symbol notation so colon independent you can also just do your column headers inside of a data frame when you reference them as a string so I'm just showing you both x and y there that's symbol notation for x and string notation for y and then the color is going to be grade now I am using string notation grade and then colon o for ordinal so it says to Advil plot macro please see this grade not as a 64 but int but as an ordinal categorical variable so if we do that and we print that we see grade is now one two and three it is seen as a categorical variable so it's not a color scale as we would have expected so there's a bit for you to learn about as far as Vega plots concerned there remember you can use symbol notation and strings and then we can interpret these as a data type that we require so I'm going to be a bit more verbose now just to show you how these things come together because what we've been using up until now is just some syntactic sugar in other words there's just some short-hand notation there but we can be much more verbose and try and start to express things closer and closer to what proper JSON would be and there's just a couple of differences between JSON and how the Julia implementation of VegaLite.dl is so let's go for it so I'm piping DF into my Advil plot macro and now I'm saying mark equals point so I'm actually specifying that this is my mark remember we have marks and encodings and we might also have transforms but we're gonna we're sticking out with marks and encodings so I'm actually saying mark equals instead of just using the colon point and then I'm saying encoding and because I want to pass different things about my encoding I put them in curly braces once again this is not part of a dictionary this is just part of the syntax of this macro so I have x a y in color my three encodings x is going to be and because I want to specify different things about x I put it inside of curly braces and when you go to the VegaLite.dl website you are going to see that these things form a hierarchy so under encoding what you want to encode and say on the x-axis for instance what are all the things that can go into the x-axis so they're all these these properties that you can set and it all forms part of this hierarchy so with the x for the x-axis I can say something like a field so the field is going to be the independent column in my data frame and the type although it was it would be correctly interpreted anyway but I'm forcing the issue I'm saying type is quantitative so not ordinal not nominal but quantitative y is the field equals the dependent column and the type is also quantitative and color the field is grayed and the type is ordinal so I'm going to get exactly the same plot as before no difference I just used a lot of syntactic sugar up front with this one look how neat and short that is but now you can start seeing where this where this all comes from so now that you have this basics of a mark in an encoding and you know a little bit about the structure that the VegaLite at VM plot macro how it is manipulating or using at least that JSON let's add some common plot elements for me a plot must have a title and I must have x and y axis at axis titles so let's go for that so I'm piping df into Advil plot macro I'm just saying point not mark equals point is point a title you say title equals scatter plot and if there's some other things you can do with a title so if you wanted to do other things because that's just the text of it but you can do the font size the font type the way it is and I'll show you in the next you can see it peeping at the bottom of the screen there but don't look there look here now on the x-axis I'm just saying independent instead of field equals colon independent so you can leave things out the macro will know what to do and now I'm also passing a title independent variable and dependent variable and if we looked up here I just said independent and dependent as far as these were concerned and I just took that from the column header name but now I'm specifying something else independent variable dependent variable color the grade is ordinal and the legend equals and there's things about the legend for instance the color etc but all I want is the title but because I'm specifying down the hierarchy some of the parameters of this encoding I put that in curly braces as well so after a while you'll you'll start to to know where these curly braces are supposed to are supposed to go and there we go I get scatter plot as my title there I get my axis titles and I get grade variable there as my legend title beautiful so let's be much more verbose so you can see where all these things are coming from so I'm saying mark is colon point my title equals now I'm not just using the title I want to specify a lot of stuff so that all goes inside of curly braces and if it's something even deeper say one of these parameters here has a set of arguments that I can use that will all go inside of a set of curly braces etc so my text is scatter plot my align is left my anchor is start my color is steel blue and you can just go on the Mozilla website and you'll see all these named colors otherwise you can just use hex code my font size see the uppercase s there is 18 my subtitle is data visualization and my subtitle colors also steel blue my encoding is the following x equals and if certain things I want to set so it goes inside of curly braces field equals independent type is quantitative see I'm using it as a string as opposed to the colon notation the symbol notation then axis equals and down the hierarchy because there are many things that you can set for the axis separately they'll have to go in curly braces the title is independent variable the title color is steel blue and the grid equals false so along the x-axis I'm not going to see a grid for y I've got a field a type again an axis and a title color but this time I've not said grid equals false the default is true by the way so I'm going to see the y-axis grid color you see what I've done there field is a grade type is nominal this time and legend because there's a couple of things that you can set with a legend it goes inside curly braces it just makes sense it's just beautiful how you can string all these things together and you can see you can see what's happened there you can see the colors the way that the title is off anchored to to the start and then left aligned everything quite beautiful there so I can become very verbose in how I want to to create these Vega light plots very nice the only thing I've really added here is just to show you we can specify the height and width as well so I'm going to say height is 200 with the 600 so these are pixels and then everything else is the same which you'll see I've used a lot more of the syntactic sugar just leaving out a lot more of the verbosity that I had before and then you can see 600 by 200 plot very nicely so let's look at some commonly used plots histogram be good just to show you the distribution it's going to burn your continuous numerical variable and just do a frequency count how many values fall in those little intervals so I'm going to create a new column in my data frame I'm going to call it height so I'm going to use the normal distributions I'm saying distributions dot normal just so you see where that normal comes from comes from the distributions package so rand distributions dot normal mean of 160 a standard deviation of 10 and I want 100 values please and you're going to get a bit of warning we haven't had an update prop update for the data frames package yet so there's some things there that have been deprecated just a something you see commonly in Julia as all the packages take their time to be upgraded so let's just plot this now here comes our first transform but it's a hidden transform and it is going to be a count the frequency count is the transform I'm going to show you here because you're actually going to if you do histograms you're going to do this commonly but it's proper perhaps just to do it more verbose so you can see where that transfers transform comes from but because this is the way we can use it most of the time maybe show you that first so piping df into the macro setting a width a title it's a bar so the histograms are also just a type of bar plot because we just we're just going to count a frequency to a frequency count so there's no colon histogram it's colon bar x is height I'm sitting bin is true and title is height so it's going to bin and it's going to it's going to decide on the bin size on its own and on the y-axis I'm using I'm going to say I'm using a transform and you see that it's account open and close parentheses it's always like a function that I'm using here so it says for my y-axis do a count of whatever is in the interval on the x-axis and I'm giving it a little title there so when that runs you can see what has happened so the bin size is 10 there and all we're seeing are these frequency counts how many are on interval 150 to 160 160 170 etc so we can control the bin size and the way that I usually just control the bins the bin size is just sitting the max bins so if I'm going to say max bins equals five it's going to still do it's proper interval and so that it's nicely rounded depending on the values for my variable but I'm just specifying bins as max bins equals five so it doesn't mean there will be five but within a nice interval you can see it used 20 as an interval now and I have four bins there and that does not exceed my max of five that I stipulated density plots they are also quite nice the mark for that is the area mark colon area as you can see there and I'm sitting just a bit of opacity so we just have a bit of see-through and here comes my first transform so here you can see the encoding remember I could have put everything in the x and y and in in encoding and this transform is the third thing other than the mark so it's a mark in encoding and has a transform and this transform lives on its own typically not how I use it but just notice that you can do all your transforms separately and with my transform it's a list and inside of curly braces then the density of the height so I'm doing a kernel density estimate of the height variable here and I'm sitting a bandwidth of three and then my x and y that has to do with this transform of mine so value as quantitative and then on the y-axis as the density so on the x-axis the height values and on the y-axis the density type is also quantitative and I've set a title there so this is how we would typically use it but it might not make a lot of sense when you see it like this and there we have our density plot of height let's just add a little bit to this and all I'm going to do is add a color and I'm setting that to to be nominal categorical so you see color there I'm still going to use my same transform and then use the value and the density as far as what goes on the x and y-axis is concerned and that comes from this density keyword or at least the variable that I decided to name there and then overall I'm also setting opacity to 0.5 overall so that we can see the two for A and B we can see the two distributions there so I'm going to show you a little bit more about transforms but this is how it will typically be used but it takes some getting used to because it's not absolutely clear what is happening there but I think you get the idea now box and whisker plots fantastic for distributions of course I've shown you a nice one in the introduction but let's just create another one I'm piping DF into my VL plot macro my mark is a box plot extent equals 1.5 what that does is it says look for statistical outliers by adding 1.5 times the interquartile range to the third quartile value and subtracting it from the first quartile value that will be the end of my whiskers and anything beyond that is statistical outliers I'm choosing a color for my mark so all the box plots are going to be orange with an opacity of 90% I'm setting a width and height there on my y-axis I'm putting group as ordinal and we know what the sample space elements there are and on my x-axis I'm putting the numerical variable so these are going to be box plots that are horizontal see I swapped x and y there and this domain argument sitting to false that's the zero lines on your plots you can see if we go back up here you're on the left hand side and at the bottom those are thicker gray lines you can set that to false and they won't show and then the size 50 that would be how many pixels wide my boxes are so let's plot that and we see indeed this horizontal box plot height on the x-axis and then the two sample space elements on the y-axis and you can see my suspected outlier there because my whiskers only go out to this extent is 1.5 times into quartile range beautiful stuff sometimes though you don't want to plot the box and whisker plots but you only want to create the actual data points and so instead of using box plot I'm using my type as a point color still orange everything else still basically the same and that why now what we're going to get and see I've put the x and y-axis for vertical ones now so instead of the two box and whisker plots I'm just getting an idea of all of the the data point values as they are there notice again though this the axes have these sticks lying on their side at minus 90 degrees and that is the default for categorical variables on the on the x-axis and we can fix that very easily with the label angle argument so I'm saying y the x at least it's group and that's seen as ordinal categorical the mb and then the axis is going to be label angle equals zero so if I do that learn behold the mb is upright as simple as that so we've already seen these the transforms and one of the transforms is aggregate and let me show you an easier way to use it and that is right inside of the encoding so not as transform on its own but do it inside of the encoding so piping df into vl plot my mark is a point so I'm going to get a scatter plot my color is orange my width is a hundred on my x-axis I'm going to have the group as ordinal data with a label angle of zero and then on the y-axis I want the height please type is quantitative and now I'm bringing one of the transforms in right here and one of the most common ones are aggregate you can look on the j vagalite jl website vagalite.jl website you are going to see a couple of transforms one of these aggregate one of the aggregates is the mean so all we're going to do now is instead of all the values all the values that we had up here we're just doing the two means and you can see that both of them there would just be over the 150 mark there so that's a better way of understanding way to bring in the transform so you could have bought the transform and separately give it a name etc but just to use it right inside of the encoding actually for me is much more useful so let's just do bar charts bar just with histogram we're using the mark bar setting an overall opacity for my mark my x-axis my y-axis and what we're doing now again is that common use case where we actually do this type of transform in this fashion so instead of aggregate equals we're just actually stating count open close parentheses so you'll see on the website that that really is the common way to do these transforms so I just want a count of those sample space elements and color is going to be by the group but now you see something new scale remember that has nothing to do with the variable we created before our scale variable that went into our data frame you know this is one of the keyword arguments there and I'm just specifying two different colors and if we plot that we see a and b and they color differently but because there's an overall opacity to my bar we're going to see this mix of colors and of course the orange and the blue is going to give me this bit of green so it's with opacity with bar charts remember toys can create this bit of color color issues but you can see it's just the frequency count of the three sample space elements in the in the grade variable then and for each of those we see group a and b group a and b group a and b and how many there are so instead of doing it that way remember I said we can do the transform as a keyword argument and the type of transform is aggregate so I'm going to do exactly the same there as the count and now you can see where that count really comes from because on my y-axis I'm going to see a field that's also great it was great on the x-axis it is great here and then type is quantitative and aggregate this count and I'm saying stack equals nothing nothing is the null value as far as uh Vega light is concerned so there we go and we see exactly the same plot so stipulating aggregate this count there or using it in this fashion as we did up there exactly the same it's going to do exactly the same thing so to get rid of this color issue of course we can just use stack and because stack is going to be default for us here we can just say bar with the x and the y in the color and I'm specifying color in this format now and what I've done here is I've just swapped grouping grade around just so that we can see nicely see the stack bar chart so all the the three grades one two and three for groups A and B is created here as a stacked bar chart and that's it that's the introduction to Vega light I hope you enjoyed it Vega light produces these wonderful plots try them inside of interact interact is a great coding environment and I use that with my individual Julia environments I create environments if you use to python of course with condor or virtual env you can create different environments so that when you do install packages it's only for that environment and you don't just ask keep on adding packages packages packages into your base Julia installation which which is not advisable and then with interact here it is going to pick up in what environment I am at the moment and it only looks for this packages that I've installed for that specific environment so when you have a look at this Mozilla developer website there you'll see all the nice colors and if you run this code in the repel if you installed everything as I have from GitHub and you run this here it's only going to give me the codes for the versions but you can see in this environment that I have here there's my project.toml file and it lists inside of this project of mine Vega light learn Vega light project environment these are the packages that live inside of that environment and that'll be very different from or from any of my other projects for for which I have an environment each and in my base Julia there's nothing I don't install I don't install packages there so give us a thumbs up follow subscribe leave a comment if there's anything else about Vega light you want to see but go and try it out of course you can use it inside of Juno my instance here running on Mac OS Catalina I can't use the plot pane so I've disabled the plot pane and that forces the plotting inside of my browser even though I'm running it in atom or Juno and if you're using Microsoft Visual Studio code it is just going to open a new tab and you're going to see the plots and in a new tab but using it here in interact is very nice and it really works beautifully