 The next session, workshop five, is exploratory data analytics for higher education, tabular data sets, and open source application for institutional practitioners. And, you know, we'll have a look at what 4IR has for us and have a couple of discussions around that. Now, the facilitators are Hekula's Combrick, and we have Prof Vukosi, Marivete, or Marivate. I'm not too sure. All right. And then we've got Prof Benjamin Rosman. And so we really look forward to this session. Once again, if you do have any questions during the session, you may add them on the chat. We will have a Q&A session. I do realize that Karen also did send, you know, a feedback form for all the sessions that we've had. Please, at the end of all the sessions, may you go ahead and complete that as well. Thank you, and have a wonderful afternoon. Thank you very much for that, Bradley. I just want to know if I'm audible. Yes, you are. Thank you. Okay, fantastic. Thank you very much. And colleagues, thank you. I just think it's very important for all of us to be on the same page. The way this workshop is going to work is in the first small bit, I'm just going to give an overview of some of the concepts. Some of you may be familiar with it. Others might not be. But then we're going to shift gear into the actual doing part. And I'm going to share the open source software and some of the things we have developed and that we're working on with in our research group. So I'm a Hercules Tombring and rightfully so, the promoters that also work with this function from two laboratories in UP and Vet Spot as run. So that's the data science for social impact research group and the robotics and autonomous learning. So we're going to talk about what is open science and sort of idea and how does it be what does it mean for institutional research and what does it mean for the data that we use an institutional research. Then we're going to cover some terms and some concepts that you may or may not know about. And then I'm going to share with you the open source tool that we developed as part of this work. So this is the only slide as a caveat. This is the only slide that you're going to see with a lot of narrative on it. The rest are all picture based and it's a little bit more interactive, but open science started from this idea of open source software in a way where you have open software that can be used by everyone. So you don't need to buy necessarily a license or if you want a premium service you pay for it but for the basic service it's free. Now open science represents a new approach in terms of scientific progress. It's based on cooperative work and new ways of diffusing and working with knowledge using digital technologies that have advanced and really gauged ICTs at a level that we could have never imagined 40, 50 years ago. So this is powerful and what open science allows for is this idea of strengthening data reuse. Now data reuse is a fascinating concept in that if you have a researcher that conducts a survey or an inquiry and uses that survey in a very specific institutional research or niche research field, there might be other researchers that may want to use this data not for what it was intended for initially, but they can still gain value from. And so hopefully by the end of today, the big take home that you should have is that there are data that we can use for data reuse and this is important because it can strengthen the work. Now open science addresses a lot of problems. One problem and there's a lot of literature written about this is this idea of knowledge should be more efficiently shared among scientists. Well, open science addresses that problem. Or that research depends on the availability of very specific tools. So if you don't have access to the tools you can't do the research open science addresses that principle as well. Science needs to be accessible to the public absolutely open science addresses that as well. Scientific contributions are have me very unique impact measurements and we're going to touch on one of these but there are quite a few in the previous presentation and workshop you would have seen the use of a sanki blood, which is a very unique and novel way to illustrate quite complex information. And then last is that the access to knowledge knowledge isn't actually universally distributed, meaning that you need to be in a higher education institution or have premium access to very specific knowledge in order to obtain that knowledge. 100 200 years ago, only those who were learned scholars had access to knowledge and those who did not were informed. So open science addresses more universal basic constructs that people need to to have and shape among scientific communities and what makes open science quite attractive is this idea that as scientists as institutional researchers as people that work in the IR space. There are certain baseline fundamentals that we should share among one another, so that every advancement we build on that is truly innovative and novel. And it's that innovation and novel that is truly profound and beautiful about this idea of open science. Now, on the screen is data that some of you are very used to this is just arbitrary random moot graphs, we've got a bar graph. We've got a pie chart. And what is is important about this is this is how we distill information and aggregate information and say there's the final product. This is something that our decision makers got quite used to, which is great. But we argue that that is a fundamental baseline. In fact, we argue that for the future, this is even sub fundamental. We want to get to a level where the type of knowledge and sophistication we have to convince decision makers need to not only be more profound analytics, but also the storytelling around the analytics need to be a lot more clearer. And it's a certain skill set that we need to illustrate this nuance. So here's an example of what I mean. This is a paper that was published using data science pipeline for course data was a case study analyzing heterogeneous student data into flipped classrooms. So what you're looking at on the screen is a graph way a heat map specifically looking at the second cohort or the second course of this flip classes. And it's a correlation heat map pointing out all the correlation coefficients between all the variables to study that code on the right here you see a key. The green it is the more the positive correlation is the red it is the more the negative correlation if it's yellow, then there's not really a strong relationship between variable base. And what this analysis can do at a glance is it can identify that there are data pairs or pairs of variables that are more closely associated with one another. So things like financial status and financial security with student success versus things like what a person had for breakfast and how well they perform in a particular test fix that the challenge comes in, not to do this kind of analysis but to try and understand why there are certain associations between certain variable pairs, and why do the context differ between universities and between departments and that's where this work it's truly exciting. You find that the more you automate the more you're actually I like nuance. And so from this perspective, we will build other. So you, this these terms are very open, and you know it but for some of you that just touch base, and what a correlation coefficient is we're just going to touch on correlation coefficients and heat maps today but the bottom line is there are hundreds of different statistical and data science and data techniques that you can use to visualize very complex data to do data exploration that is quite meaningful at a fast pace. So, on the top left and they you see that they are data points on an X and Y axis, and they're quite closely clumped to one another. Right now if you measure the linear relationship between this, this is what I strong positive correlation with the left to right, you can see that it's the one data point increases. The other, a weak positive correlation would look the same except the distribution of those data points would be a little bit further apart from one another. No correlation would be if these data points are just randomly distributed around the plug. Negative, if it's strong would be the inverse of what the positive correlation point up, and then a weak negative correlation with the same as a weak positive correlation just the inverse of that. Now, there's a lot more depth with regard to correlations and a lot more statistical depth to go into but the bottom line and intuition that you need to take from this is that a strong correlation indicates that variable pairs or changes in variable pairs mean something to one another. And if you can identify out of 2000 variable pairs to top five, then that's a really good starting to try and identify why are these variable pairs associated strongly with one another and is it worth it to investigate these further rather than investigating everything at the same time and wasting a lot of time to do this. So, to put this into practice, what you're looking at here is a correlation coefficient heat map. All of the ones here indicate 100% correlation, because the variable Chrome correlates 100% of the variable from so on the left hand side, you will see the variables on this idea. At the bottom it's the transpose of the variables listed on the left hand side so it's Chrome Z and in this etc. And then at the bottom is Chrome Z and in this etc. And so all of these are correlated with one another. So, Chrome is correlated with Chrome is correlated with Z and Chrome is correlated with in this etc. So these variable terms mean is irrelevant because this will determine be based on every data set that you work with so the name of the variable used isn't relevant what is relevant is, can you determine variable associations. So an example is, if I if I look at this is a very strong relationship between the variable tax and the variable. I don't necessarily know what the variable radical tax mean, or what I can intuitively the pieces that is a very strong relationship between tax and right. And the same year if I look at at RM and made V is a very strong relationship between RM and mid V, and the same year. If you look at the data science pipeline and this is something a lot of people have mentioned in short there are many, you know ways in which we can unpack this but it starts with some kind of a question, got a lot of different data, you wrangle and work with the data, you clean the data, and then you start with your data exploration. And so with the data exploration that we have here you have this explore data, pre process data model the data and then you repeat this process until you saturated the data exploration. What we're talking about today is this process, the process of exploring pre processing and modeling the data. I'm going to share ideas around the data, but we're not going to touch data cleaning or angling or validation or telling the story. Those are conversations for another day. We're just going to illustrate the underlying information and open data sources, as well as how this modeling and exploration can automate. And so, just to iterate again, this comes from the data science or social impact research group, as well as the rail laboratory at the University of that point is right. Now, I'm going to share this in the chat and ask Elizabeth as well to share this in the chat. So we just stop sharing so that all of us can get this link. Because I'm going to show you the open source tool that we created. But I want to take you through it first, and then I want to show you what you can do with some open science techniques. I'm sharing it in the chat. And let me share my screen to take you through just the high level of it and going to start playing around. So I'm going to ask if you can see my screen. Yes, we can. Thank you very much. Okay, so to understand this a little bit, this is a GitHub page where it's a repository of a lot of different things that we do in this space. There's a tutorial on here, because I am not going to assume that everyone remembered everything I said, and I'm also not going to assume that one intervention is enough to learn what this means. So by preaching open science, I'm actually demonstrating open science to you all. So some of the work that we did, and we'll continue to do we're going to put on this open higher education exploratory data analysis platform. And we're going to share our code, the data that we use for our code, and how we came to these conclusions so that you within your institutions can repeat, replicate these results. And if you struggle with it, you can email us and we will assist you with it. But the idea here is that you use some of this so there's a few things going on. There's some folders on top and I created a nice little reading at the bottom so there's a tutorial which is the first bullet point. Which is presented as a notebook. So if you click on that, it takes you to this page where it says okay, there's a notebook and there's some read me. So if you click on the notebook, what it actually shows you in here is the Python code that we used to create this exploratory data analysis. And what we did was this analysis is we built it into an app as well that you can use if you've got a CSV file, you can actually play with it today. So I'm going to show this to you when we get there. But the bottom line with this tutorial is it starts off with the dependencies that you need. If you do any kind of R scripting or Python scripting, we need some kind of library dependencies. And I explained that a little bit in a year so that you know you would know. Then this is where you add your data. And then you can start looking at your information. You can start looking at what the variable types are if you're missing data in here. Once you have determined that you can do all the descriptive statistics that you need, looking at standard deviations and means, etc. Once that's done, there's always a variable of interest that the student pass or fail. In this instance, I used a Nigerian open source data set and I'm going to go over that data set in a moment from an engineering school. But the majority of the students passed. There were a lot of students that passed with distinction. There were students who borderline passed and then the students who fail. And if you go into the publications, which I will share with you just now, you will get the gist of what these data sets actually mean in terms of all the variables. If you look at them, you will find that there's an ID, there's female male year of graduation and a grade point average for the one, two and an overall grade point average, which is shared in this open data set. And then once you plot all of this data in the heat map, this is what the heat map looks like. So you can see that there's a very strong relationship between certain variables and there's a very weak relationship between others and you can explore and go into depth as to why this is the case. And then once you've done all of this data exploration, you will eventually find out which variables are the most important in your data set to use. And then you can do some advanced multivariate statistical analysis, machine learning analysis, or use other techniques that data scientists love to use to try and either predict at-risk students or try and determine what the plausible or possible outcome is with a current cohort. So in this instance, I just used a clustering algorithm known as K nearest neighbors, and I wrote down the entire algorithm with its code for those who technically want to know how it works. And if you don't know how this programming works, I would suggest you go through the tutorial and try it yourself. And if you do know, then you will see some of the hyper parameter tuning that I did. And then ultimately you'll get up with some model that looks like this and says, okay, well we can predict student outcome with at least a 65% accuracy, given this data. And we can apply to other data and we'll get more or less 65% accuracy with this, which is great. It's not perfect. This model is by no means the best model to use for this kind of task. But it's to illustrate that to get to this point where you want to do fine tuning of hyper parameters, we want to use machine learning that you need to do some base exploration. And so that's what this entire tutorial is about. And it's in there for you free. You can go and you can access it. To navigate back to the home page, you just click on high education EDA, any of these two buttons. You can click on that. And it will take you back to the base page. The second thing resource that is openly available that we put in here, because we found it incredibly valuable with the work we did. And we wanted to share that with you in terms of exploring data and some of these concepts is the open data folder. So again, you can navigate by clicking on the open data folder over there, or you can click on navigating the data link down there. And on this, you will find that there are five data sets that are in here. The Colombian engineering data set is by far the most comprehensive and large are not going to open this. This is more for your data and science enthusiasts if you want to play around. But I will show you some of the data. So the first data set is from a Nigerian university, which has to deal with the engineering students specifically in this space. So to open the data, you simply click on the link that says engineering transform.csv, which is a CSV file of the data. And it opens up the data and you can download this data and do experiments on it. So it shows the grade point average of the students. It shows the year of graduation, the final result, and the identity. Now, at phase value, this is a very good model or very good data set that you can use to determine continuity of progression of marks between different years. But it's not good enough for cohort specific investigations. So that's where something like the UAE data set comes in. So the United Arab Emirates released an open source data set, giving the final results of students plus some demographic information. And then it tracks the entire school marks from grade one to grade 12, in terms of all their marks in mathematics, English, science, per term, per quarter. And it's openly and publicly available and there's about 2000 students information in there. So it's a lot more comprehensive and you can stress test some of these ideas that you have in there. You can play around with different kinds of visualization techniques that you need to do. You can also compare whether the context you sit within the data you're collecting matches up with the kind of data that they are collecting within the United Arab Emirates. As I said, I'm not going to open the Colombian engineering data because it's comprehensive. What they did was they took survey data at four, four critical points over a period of 18 years with students ranging from the point that they started school to the point that they graduated and some going up to PhD. And they also tracked all of the academic performance indicators throughout this is a big data set. So I'm not going to open it now. The force is a well known data set, which is based on two Portuguese schools. The one is more stem focus than the other ones more social science focus. But in year there's a lot of information around the family size of the student, the, whether the father has a job, the mother has a job, whether the student has national funding in the South African context. We have an NFSA is program, which funds a lot of undergraduate students and in Spain and in Portugal, they've got similar mechanisms. And what is nice about this data set is in addition to all of this sort of the information about where the student comes from and what the home environment and situation looks like. There's also a picture of the student's life. So does the student have access to internet is the student have a paid job. Does the student have any extramural activities that they engage with, do they have a romantic relationship or not. There's additional information that they collected based on these these these scholars and then in addition to this also the kind of marks that the students had overall given social science and and and then the last one is the open university, a snapshot of the 2013 2014 cohort for stem subjects so that open university is a large MOOC in the United Kingdom. And in there they've got data from 32,000 students give or take so it's quite a large data set. And it's not by no means the only data set that's available is a much, much larger data set we're going to talk about that in a minute. But there's 32,000 students information in and there's two kinds of information sheet. The one is all the data about the students in terms of what they do. So what they do in terms of the academic marks, the class attendance and things like that. Well class attendance is virtual it's not physical. So that's the one thing. But there are about 100 variables collected in just information about the student. The other half of the data in here is click data on blackboard or the virtual learning environment that the open university uses. So what time of day did the student access particular content and what were the marks of the students, etc. So there's a lot of complexity with this data set and so, like I said the open universities data set has 32,000 students data of this. So it's millions of data points around the click data around surrounding everything about the students given a year of study in stem fields and social science. And so there's a lot of things that have been published about this, you know, stem versus social science and whether the students are the same or different because their assessments look differently in the same subjects there are more assessments in the social science they are continuous and summative assessments. It's, it's, it's a very interesting way of comparing cohorts because we have asynchronous and synchronous assessment opportunities and learning taking place in both instances. And so there's a lot of questions that are still unanswered just given this one data set. And so I'm just putting it out there as and this is just one snapshot of one part of the data set I think it's 4900 students data that's being shared here. And as you can see this, there's quite a bit of information on the right you can see that there are many variable types and it's, it's quite comprehensive. Now, these are examples of five open source higher education data sets that are available to anyone with an internet connection. Obviously, all GDPR and POPPIA and some of the other compliance mechanisms have been applied to all of these data sets they've been de identified in the correct way there's no way to infer the individual students from any of this data. So the participants in the students data that is here is safe to use. What is important is that the data that's here illustrates the kinds of possibilities what one can do now that this data, you know are available and what you can do with it. In this repository, you can go to the bottom and you can see all the major publications of each of these data sets on there, and you can go ahead and find the sources of the information, read the articles, what the people found initially, and then some of the some of the challenges that they are still facing. Okay. Now, I'm sharing all of this with you. And all of this so that you can understand truly what open science means literally means. This is some of the hard work we had already done on a silver platter and we want you to engage with. But we also know that a lot of people may or may not have explored data in very similar ways. And so what we're trying to do is understand where we can have issues where there are areas that the fall that require improvement. You know, all of these and to engage with us you can either email us or you can simply click on the issues tab, which is up there. You can say new issue. And then when you click new issue, it will open up the title of your issue so we'll say well, I am in the workshop, and I can only see on now. Okay, I'm just I'm just giving an example just giving an example, because your camera is still on so I just, I just gave it as an example. So then I can leave a comment and I can say for the workshop. I would love for this relation right so I'm not submitting this, this email I'm going to remove. I'm going to, so I'm in the workshop. This is my issue and then I submit new issue. And then what will happen is, there will be a new issue. So if I click on issue, there will be a new issue that says I am in the workshop. Now anyone can post an issue. And then what we can do collectively not just us anyone, we can respond to this issue, we can discuss this issue and we can make adjustments to this issue given this platform. So, up until this point, we're all on the same page we, we have all of this. So what does it mean for you, do you necessarily need to be a coder to use this tool, you know how does it work. And so what we decided to do is to package all of this in an automated way, using an exploratory data analysis app. And so the way that you do that is you go onto this page at the bottom says use the exploratory data analysis by clicking here. So when you click on it, it will take you to the app. Now, this app works specifically using CSV files. Now, for those of you who so so if you use Microsoft Excel as an example, you can save as a CSV that's a very easy way to get a CSV file. And the CSV file assumes that there's only one sheet in there and not multiple sheets. So we're looking at one spreadsheet that has a good structure. And so what I mean by that is that you don't have empty rows and columns, you've got no empty columns. You've got rows on the top row, and or columns on the top row, and then at the bottom are the rows with all the different entries. In this case it would be different students and what they represent. And so once you've, you have your data set whatever you want to use and you can use the data sets that we made available on the website. I'm going to show you one. You can take the data set and you can load it into the app. So the app has a limit of 200 megabytes. So we don't want you to upload a CSV file of 10 gig or anything like that because your computer might crash. So there's a 200 megabyte limit to the data. However, once you've imported this, and this is why automation is so so neat. It does all of this for you in your own data set. So it gives you your descriptive statistics. It creates a heat map. It creates all the variables that will potentially have missing data. Obviously, I know there's no missing data in this data set, but you, you potentially can have missing data on your own data set. It gives you all the correlation coefficient between the variables in the table. It provides you a table telling you which variable pairs are the most closely associated with one another. And so when we got to this point, we realized, oh, this is really fascinating because we already found a way to do this exploratory analysis in automated, which is great. But there's still a lot more that we need to add. And so we built on an already existing open science architecture known as sweet bus. So you can click then on the button that says generate a sweet bus report. It will run the on the top right hand side, it will say running. And then once that's done, it generates the sweet bus report. What the sweet bus report does is it gives you your full descriptive analysis with all the graphs for the data set, its distributions, and all of the information in there, which variables it's closely related to which ones it's not, etc. All of this is in this sweet bus report. Now, some of you might wonder, so how can I take this sweet bus report and send it to my colleague with someone like that? And the easy way to do this is you can right click, you can say save ads. So it would be a web page. I created a folder already. So I'm going to say sweet bus page. I'm going to save it. And what it's going to do is it's going to download because it downloads all of this aggregated information. Remember, this is aggregated information. So none of the raw data used in here will be stored at all. It's just the aggregated information, the aggregate information that's in. And so once you're done with that, I'm just going to show you access the file. Let me share my whole screen. So I just saved that. And I said it's the sweet bus page. So I would just go into what I saved, and it would be resource number two. And so what's really nice now is I can open resource number two at any given point in time I can email resource number two, and it gives me the full descriptive analysis that I just performed. And this information is there. Finally, when you tick on the button associations, it gives you all the variable associations between the variable. The squares indicate categorical variables, bars, fail, things like that that are categories and the round variables are numbers. So great point average, etc. Okay. So let's do another example. We just quickly see. So let me do another example. Yeah, I'm going to share screen number two now. Can everyone see the screen? Nope. No. Yes. Okay, fantastic. So if I take a more complex data set to something like the open university data set. So loading it. And the, what's really fascinating is that it uses high performance computing. So I know that in some way the internet connection, you know, is very important for for this kind of work. And I am streaming now using this platform zoom but once you're done with it, you just load in your CSV data. So you can see that with the open university data set, which is a lot more complex. You now have a more comprehensive heat map, you have more comprehensive variable pair associations. And you can generate a more comprehensive suite for this report, because there are more variables included. And so what this essentially does. And I think this is the main reason why this kind of exploratory data analysis in an open science way so attractive is that as institutional research is you don't want to spend time doing the technical analysis only want to spend time with the storytelling and the data science associated with it because when you want to convince decision makers, you need to take evidence to them, but the evidence that you take to them needs to be comprehensive. But for example, all of the data exploration that was just done by the app in an automated way, I didn't have to sit and go one variable at a time and work through it. We essentially created an open source tool can be used by you for data exploration. So before I continue on and stop writing, stop sharing my screen. So I'm going to ask, up until this point, are there any questions? Either everyone is so excited about apps or you're kind of shocked. There are some questions in the chat actually. One is from Satu. Would it be possible to use the tool for for looking into networks amongst students engagement in LMS. So this is a graph analysis that Satu is referring to. So a graph analysis is a little bit different. So what you do with the graph analysis is you want to look at the associations between groups of students in how they engage, let's say on a learning management system. As an example, we're in the process of building that because that requires a lot more sophistication than what I just showed you. The idea is within the next few months to have the graph analysis tool up and running so that you can use that as well. And it would be very interesting actually if you could share with me just the structure of the data. I don't need the variables to the detail of the variables are just the structure of the information because click data and LMS data is very unique. So it would be it would be great and you can please if you want to reach out to any of us in our research groups so that we can absolutely collaborate. I think that's very good. Another one is from me and I think I did the typing error there. Can multiple people work on the system at the same time. It's a it's an internet based individual based IP system so what I mean by that is all of us can go into the system now and independently run our own analysis. What about the private. Okay, what about the privacy of students information when used here. Okay, that's a good question. So, so there's two ways in which we mitigate privacy issues in this instance and that's why the open data is shared on the. So the first thing that you should always do when you use any analysis tool and at the moment legislation isn't in that space yet because the legislators don't understand the complexity necessarily that is involved with it but hopefully this becomes part of normal legislation. Any data that you use and process that requires an internet capable software so let's take Microsoft Excel as an example. You might not be aware of this but if your institution uses Microsoft Excel, the security measures in ICT need to be there to ensure that the Excel use especially when you use Microsoft 365 is there to encrypt your data and use your data otherwise you run the risk of leaking data. Anyway, having said that, whenever you work on any tool or any software or any device that uses student information so let's say something that has any link with an intimate device. At the moment you use that outside of the people soft or he missed a minute, you need to make sure that the data that you're working with I'd already gone through some form of the identification. I've seen institutional researchers email student data to one another I've seen student researchers put student data on a flash drive and then give it to another researcher to do or not just research a practitioner to use this data on a flash drive with another. The time for doing analysis and work like that is is is drawing very close to to an end because the risk of leaking information using that is quite significant. So the point I'm trying to make is when you use any software and you use student data you should to enlarge use it depends on the level of specificity work with you should work with the identified data to begin with. That said, the data does not get stored in any way shape or form in any way site on the back end or anything like that and the information is only the aggregate of the information. So if you use this app, the aggregate of the information is is is what you're working with you're not working with the raw information, the heavy lifting happens in the back end and and it stays there because we don't want any data to leak to begin with the idea is and this is why the tutorial is also important if people feel scared for whatever reason to use this tool. That's fine. We shared the code in the GitHub as well so anyone who uses Python and we're going to probably write an R version of it as well. Anyone who uses Python or state or SPSS can use this code and replicated on their own sample so this is what we're working towards and this is why open science is so important because we don't want the fear of the unknown when it comes to data leaks and things like that to be the hindering factor between yourself and exploring the data in ways new innovative ways that can provide more meaning for all of us. I hope that answers your question. I think so and it would have also if you want to add more you can but it would have answered polite question as well. They stated that when I upload data is it stored somewhere on the system. We work with sensitive data. No, there's no, there's no back in data warehouse. You think that's information though. All that it does is it just uses the initial information and it processes it and you will see when you use the app, it times out of the minute or two because it's it briefly just stores that information and what is known as the cash for I think it's half a minute or a minute. And then it reboots, reboots, reboots it and it's kept local on your machine as though you're not using some server out there describing information when you use it. But as a rule of thumb, not just for this app for anything, when you analyze student information, let's say you get it from people soft. Deidentify the information as a rule of thumb unless you're working with very specific student queries, you have no need to know what the student members of the details of the students are in terms of their personal information. On average, if you look at the annual performance plan and the indicators that are measured they were institutional researchers with no annual performance plan that specifically deals with, you know, knowing the names and the surnames of your students as an example. So unless you're doing an individual query, the moment you work with system based queries, the identifier information and there are many approaches that that a lot of our colleagues in the West and in the East use to deidentify information in a way that is safe so that we can share. Why do you I mean I showed you how some of the institutional data amongst other universities in the UK, Spain, and in Africa in Nigeria are being shared in a way that is meaningful for other institutional researchers. But I think about the basic principle here is when you work together, you need to make sure that the data quality is really good, but that it's be identified in a way that is meaningful as a standard practice. Are there any other questions from the audience. You can raise your hand, and then we will recognize you. Nicholas, is there a room for customization of the code if. Okay, so it means if I, if I've got a file with more than 200, you said 250 megabytes. So I need to run it on any or on our or at the moment it's only on Python so it means I need to understand your Python code in order to, to learn what it does so that I can run it from my machine. Tell me, is there training that you guys offer for people who might want to learn how to use the system and also adapt it for the environment because I think 250 megabytes is very restrictive in, in, in relation to the kind of data that universities work with. I don't want to compromise them by giving them open sources and they start going there and they put their universities at risk so do you offer training and how do they arrange for it. So, so thank you. I, I, yes, we offer training. What you can do simply do is just either email us directly, either because he or myself binge binge, but what you can do is you can log an issue on the GitHub as well, which is, which is also useful so that we know that there's training that that wants to be there, but you can email so do we provide training absolutely. This is in line with open science I mean it's it's not I work for a study at the University of Victoria, I work in the University of the freestyle context as in an academic capacity but with an absolute heart and. Oh, an absolute heart and passion for institutional research, and I see someone sharing their screen. It's very exciting. It's a little bit. I want to see who's sharing the screen because this is really cool. Oh, Elizabeth I see you're sharing a screen yes. Oh sorry, my bet yes I'm sharing my screen because I want you to answer bongies call is this to know wait. What is the best use case for the tool is for the tool is best suited in so I guess maybe probably they want to know if this tool can be used for anything else, or for a case. I'm just demonstrating here while you were doing some demonstration I went and got an a data set online on flights and yeah is the same information that you just showed so you can use it for any use case. You see and now and now you are an expert on the tool as well. I'm not necessarily but I'm just demonstrating that the possibilities are very white. Absolutely we, we, we have a specific vested interest in tabular data or data that you can put in a spreadsheet, because a lot of the data science is very interesting because when we work with visual data or audio data or some of the more complex things like that. It's, it's very hard to work with that information but we learning that most people when they talk about data they talk about a spreadsheet they talk about a table with you know an array of sorts. And so what this tool intuitively what we want to develop is a capacity for people in institutional research and to explore data to follow some kind of a concise racially to explore their data so that we know how we came up with the results in the institutional research space. This is very important. Now to answer the question. Is this a free tool you need to subscribe. It is free. It's open. It's days open science. What do we gain from it to advance scientific research. Why is it important. Well, there's there's a lot of questions we can answer. I mean, it feeds into this idea of, you know, what is the purpose of our education to begin with, and we're trying to address that by uplifting the absolute scientific rigor that we do in the institutional research space, so that the work we do is not only benchmarking the cutting edge but the bleeding edge of the kind of inquiries. Now if we want to make institutional research the absolute pinnacle of science in education, high education science, then we need to make sure that all the scientists working in institutional research have a baseline that is above what we can imagine so that we can explore data in a very meaningful way. And I think this is why the collaborations are so important. And I agree with you more. No one should be left behind and I think that is the theme for going forward in higher education as well. We want to bring everyone on board when we do that and thank you for championing that as well. Here's another question. Do you also do tabular qualitative data? That's an excellent question. So now, now we're going into the bigger data science side of the work that we do. Professor Marivati is a champion of natural language processing and qualitative data in terms of extracting topics in terms of modeling qualitative information for insight among large corpus of data. He's published a lot on political sentiment and Twitter sentiment and multi-lingual translations. Anyway, the answer is qualitative data requires a very specific analysis. And if you are interested in such an analysis and potentially tools that can facilitate such an analysis because we've built tools like that. So this isn't hypothetical. We have tools like that. I would propose that you reach out so that we can make way of a workshop of sorts so that we can teach you how to use that because it comes with a lot more nuance and qualitative information is a lot harder to deal with than tabular data. I'm not saying that tabular data is easy. It's just conceptually easier to deal with than qualitative data. So yes, we have tools for qualitative data but reach out to us so that we can see how we can maybe organize something. Any other questions? There are very great comments in the chat if you are able to go through. Thanks Elizabeth and thanks to this informative presentation. Look, I'm just thinking here because you've made that tool to be very user friendly, I see. I haven't got a time like you, you've done now Elizabeth just to put it through the paces and see what type of statistics it gives you. So I'm thinking here, you see with tools, they're all good for their easy usability but when it gets to be a little bit complex then it gets frustrating for sure for the end user. But my comment is as a question as well is on the fact that with information, we refer to garbage in, garbage out in the information sector side where if you're feeding the tool the wrong information obviously you're not going to get the relevant solutions to your questions. So I'm thinking like I was asking on the text about the question that I proposed on the chat. On the issue of the use cases, which ones the tool is best suited for. So I was just recollecting my thinking here, I was not even looking at the chat. Now you said it covers almost anything that you would give when it is in the spreadsheet or template format. And I'm thinking, because I can get almost any information in the template space and then put it there. Now we're looking at maybe what types of columns that I'll be really looking at for me to get the relevant information. For example, I'm looking at success rates of say a particular cohort of students in a particular year starting and trying to see the predictive analysis as to who maybe will make it at the end of the year. But what type of information that I need to fit in into the system for me to get the relevant or the solutions to my question. So I think along those lines of I understand you don't have to be a data scientist as such to use the tool. But again, I'm thinking, am I going to get lost in the interpretive side if I'm giving the information, I mean the wrong information or the wrong type of data for the questions I'm trying to get answers to. I'm not sure if it makes sense. Absolutely. I'm really grateful for your comment and for those who don't understand the impact of the comments. This is something we grappled with quite extensively and thinking of getting maybe a postdoc on this as well. I don't know if this is, sorry, say to you. It's interesting. I don't know if it was plastic. Sorry, say to you. Code, some other code they put on top. You are muted. That is hilarious. That is hilarious. Thank you very much for that. Okay, to make this as short and to the point as possible. A system, any system, any tool, any analysis is only as good as the human intuition that is primed. For example, if you do a statistical analysis, you can do the best multivariate analysis on the planet. But if you select the wrong variables, you will sit with a correlation causation fallacy. And it's not the fault of the model. It's the fault of the person who made the initial choices. What this tool is intended to do at a very high level. I'm really grateful that you asked this because it's a problem very close to my heart. If we can't explore our data in a meaningful way, and I use the word explore explicitly, because I'm not inferring anything yet. I'm not predicting anything yet. I'm not exploring the information to understand my dog. If we are not exploring our information in a universal way or in a way that makes sense, and pick the variables cherry pick the variables that we need for further investigation because the exploration pointed in some that we don't do this process properly. Then, what often happens is you get data scientists to jump in prematurely to an analysis and do the most sophisticated analysis that it possibly is on a problem that could be solved with account on a problem that doesn't mean all of the sophistication needed to do it. It is important that people think is a very strong term, but it's important that people think through the analysis that they do and the best way to think through what you do is if you explore what you do first. We've seen multiple data science pipelines where people get a prescriptive question, we want to know what the success rates of x are, but there are so many caveats and nuances in the system that by the time the analyst or the data comes up with the number or comes up with the figure, so much context was lost along the way. And I would argue that that context is what makes institutional research so powerful, because the numbers that we work with have people associated. The numbers we work with mean more than just that number that we get as a result. There's no major implications when the policies changed in higher education there are major implications for this so what variables to choose if you use the tool. You go into the sweet versus report, as an example, and you go to any of the one variables and kick off. You will see, it gives you a full breakdown of which variables are more closely associated to it, these associated to it and to what extent. Now, it's not going to answer, and give you the answers that you're looking for, but it's going to point you in the right direction knowing that you explore the data in a way that's meaningful. So you might see success rates. Hmm, very interesting that, you know, hunger has an impact on success. It's very interesting to know that access has an impact on success rates or internet at home. So let's start addressing some fundamental human problems within these systems. I loved what Professor Jansen posted a while ago, you know, do we have underprepared students or underprepared institutions and I would always argue that the burden is on us to make sure that we have our story straight, so that we can best assist all our students, because I feel very strongly about. That question. So it's a very good on the surface. It's about, you know, the, you know, what can we explore, but there's a lot of depth. I'm just reading the comments. Okay, are there any other question or comments in the, that you want to raise. Heckiness. I'm just as one one one comment from me. And I think I'm from here then I will, I will stop asking questions. In terms of, of engagement in terms of when, when you're working as an institutional researcher and also. Because I think to answer the previous question about the the best use case and whether garbage in garbage out. What does it take as a as a researcher who is not clued up in some of these technologies and and and terminologies but you want to do research. What, what will be your, your, your guidance in terms of that person, what must they do in order for them because, yeah, in order for them to do. Or to use the system properly and understand what is happening behind the system. What would you suggest they need to do. I'm talking about in relation to do they need to partner with you. Or can they ask we have or find a someone who is well vested in the same system or in the same type of a tool like Python to assist them, understand the, the, the, the workings of what is happening. And also, for the basics in terms of when you go into this system. What are the requirements like, do I need to know statistics do I need to know correlations do I need to know what does Nampi mean. Oh, is. Yeah, so what is it, what can you suggest that person do in that in that regard. Firstly, thank you for the question. And I'm really grateful that I'm not sitting in a room physically in front of you all because I might get choose thrown at me for what I'm about to say. But I accept absolute full critique and I'm glad this is being recorded. So, what often happens in the for our space is that in uninformed people in general would like to throw technology or tool at it, because their own bias is being covered up by the potential of another system or another tool. So what I mean by that is, you see this in a lot of spaces where we had unanswered questions about students it says 10 years ago, we still have unanswered questions about students it says, 10 years later, with the difference that we've got dashboards that can help us find these answers but we're still not finding the answers and becomes less about the tools and more about the fundamental thinking and questions around it. Now if you want to assist in building the capacity in terms of its tools. So I'm a firm proponent of the SAE, and I absolutely believe in the platform of the SAE, as, as the link as the spiders with to clean to collaboratively link institutional researchers and academics like myself that care about institutional research with one another so that we can strengthen our own capacity. So I believe that engaging with one another through projects and things like that is is a good start, but we are moving into a time where data and understanding information and data is becoming more important. And in order to best learn data, you need to choose a data type. So whether it's text, whether it's tabular data, like a spreadsheet, whether it's voice, whether it's images, and there are other data types like ledgers and things like that, but we're not going to go into it. Choose a data type, and then become proficient at the tools at the languages at the kinds of analysis that happens around those data types. So if you look at qualitative analysis, it's like, okay, are you using natural language process? Are you using LSTMs? Are you using machine learning to help you analyze this information? Are you using some kind of a software like in vivo to code your data or whatever the case. It's easier to start with a very specific data type. And then from that data type branch out to see what the industry standards are. This is where it gets tough. Industry standards differ between universities in our context. Every institution has their own critical mass and capacity of things they can and cannot do as their own capacities of things that they are exploring and developing. And I think this is, this is where my excitement comes in, is we have not yet reached the epitome of what we can do with data. The tool that I showed you today is really nice, but in my opinion, it's still elementary compared to where we should be. We should answer fundamental questions, fundamental questions in higher education, like what is the point of higher education? We need to provide a tangible answer to that or are the students support interventions that we're giving in higher education enough to support our students? Are we creating perverse incentives by chasing the annual performance plan indicators? So what I mean by that, and we saw it today, is that we want a success rate that is of a certain percentage. But are we willing to compromise the integrity and the quality of higher education to reach that target? And if the answer is yes, then we've lost the point. So I would argue that for anyone who wants to enter the space, they need to have passion, they need to have the ability to learn, because if you don't have passion or the ability to learn, then unfortunately this might not be the right industry for you. But if you have those two fundamental baselines, learn a programming language and learn what kind of analysis is being done to convince decision makers in your domain. So that's a good start. And if you still cannot convince your decision makers despite having evidence, please send me an email so that we can explore how to do this further. But it's very important that we collaborate. And I think the SA AIR is a wonderful network for this type of collaboration at a high level. Thank you Achilles. Are there any other questions? Now is your chance, your turn, because after this age, we're going back to work. No question. No comments. Okay, then it means, okay, I see there is one, one comment. Oh, everyone left your presentation. Excellent discussion. Thank you. Thank you. Yeah, so it means you did, you did the deed. Thank you Achilles for the wonderful presentation presentation that you gave us today. And for sharing. I want to share my screen. For sharing the work that you guys are doing at UP and Virts. It's an excellent tool and it's a step in the right direction because I think going back to when we were doing the introduction to launch the LENA analytics special interest group, one of the things that I mentioned was that we want to also share and collaborate and have a community of practice in terms of analytics as well. And I think this is one of those ways that you are helping in terms of making sure that we're not leaving any institution behind, whether you have the capacity to do something or not. If you have tools like this, then every university can plug in plug in and get the results without having to say oh we don't have a data analyst we don't have this we don't have that. But the tool tool like this can advance the work that they need to do without having those capacity in their environment as well and thank you and we appreciate the good work that you guys are doing and continue inspiring us inspiring and doing the great work. Thank you. Thank you very much Elizabeth I think this is just this gave me an idea and I think this might make it easier for other people. What I can do over the next few months I can build an offline version of this tool, which would probably be a very small app that at least at the very least, if you can have it on your computer then you can use it on your computer locally. Just to do that and maybe that can help for some of the colleagues that I know have challenges with internet connectivity and some institutions due to low sharing. Thank you we will be in touch with you and we will hold you to that promise because I don't forget. So thank you very much for for that as well and we will also place that the link for the tools and everything on our platform as well on the SAE platform so that everyone who missed it here can also receive or get it from later on from from the platform. Thank you and now we are almost or not almost we are at the end of the the program. I'm going to post the same message that Karen has sent earlier. Please remember to complete the feedback form your input and your comment and your suggestions are valuable to us in order for us to improve here on. We will rely on your input as well so you should have already received an email from Karen this morning or even yesterday please make sure that you complete the feedback form. We will have the recordings and the recordings and the presentations will be made available. I will communicate with Eshten to make them available on the website as well. And Karen will send you an email once everything has been uploaded so that you can have access to the recordings and the presentations. Thank you now before I do the thank yous as well to give to hand over to Lerato Dilekena our Vice President of SAE to do the final thank yous and closing. Thank you Dilek. Thank you very much Elisabeth Wow heckles that that was wonderful wonderful wonderful. It fits the phrase save the best for last. Everything was very good but that was also very very good we really appreciate your effort your presentation was super excellent. And on behalf of the SAE Exco, we would like to send our heartfelt gratitude to every one of you from the delegates that took time of their business schedules to attend this. To the presenter slash facilitators of the workshop to web workshops to the host University of Western Cape to Elisabeth and her team to the SAE Exco members who also worked tirelessly behind the scenes to the facilitators of all the sessions. We cannot say how much grateful we are to you. And lastly, but not least, we are very grateful also to our only employee of the SAE Karen, who worked tirelessly tirelessly also behind the scenes to make sure that this becomes a success. This has been an invaluable workshop that lasted started yesterday. There's so much that we have taken home. I wonder where we're going to start when we go home, but the biggest takeaway for me and I'm sure for other people also is that there is a student behind the data. I think that is the biggest takeaway. Sometimes we deal with this data crunching data crunching and you forget that there is a student behind the data. And one other very big takeaway is that the tools are good they had to help us, but the data does not speak for itself we still have to go and really do the synthesis of the data. What does it mean with exactly the idea that there's a student behind this data. It's not just numbers, it's not just whatever it is that you have coded it into. This means a livelihood for some people, the future of our country, the future of the continent itself. So I think we should go back, recharge, revitalize to our own behind our own guests and really do some valuable things for our institutions. With that I would say thank you very much. Thank you, Elizabeth. Thank you everybody over to you again. Delay. I also want to give my appreciation to my team, or the people that I work with, they are from institutional planning, Bradley, Kumalo, Kamelita Benjamin, Andrew Maguola, you guys really came to the party and assisted me with running of the workshop I wouldn't have done it better without you guys. Thank you. Know that your efforts didn't just evaporate, they went some way and we hosted a wonderful and inspirational workshop and by saying so I also want to thank everyone who had the call and registered for the workshop and participated in the workshop and thank Takalani and Lisa for the media. And also thank Karen for all the help and assistance because I kept on asking, are we getting there with the numbers, are people registering, I'm worried, but thank you Karen for managing the administration of this workshop as well or the institute as well. I really appreciate all the effort and I couldn't have done it without Ashton as well because he really came to the party as well to assist me in most of the things that we planned for this workshop to happen. And yeah, thank you to UWC, my university for allowing me to host this year's then analytics institute at UWC and I really appreciate it and I hope everyone who came here or who attended this has really learned a lot as much as I've learned in the last two days. And hope to see you again next year in 2023 we're looking for the host for 2023 so it's up for the taking, let's see who can up the game and from where we are now it's about improvement, improve where we fell short. You can do much better and yeah, and good luck to the next host for 2023 the call will go out. Karen, do you have the last way to say in terms of the logistics. Sorry, I was busy here working. Sorry Elizabeth just repeat the game. Anything they need to know about the logistics. Otherwise, you will send them an email, is that right. Yeah, yeah, yeah, we can discuss that further but if there's anybody who's interested they can just contact me and then I will speak to them further. Okay, thank you very much. I'm going to do things different. I'm going to open for one or two or three comments to just say anything you want to say right now. Your take up, your take up, your take hold, your appreciation, your comment. You can open up. Dorothea, do you want to say something? Yeah, just to thank you Elizabeth and all the presenters as well. It was really quite an amazing two days. And thank you Elizabeth for the before and during and after. Our way that you hosted us and kept things going beautifully done and the presenters were really lovely and yes for me the take home also is there's a student, a live person behind the data as Prof David Skatad at CDR. Thank you colleagues. My last person. I'm sorry, I forgot how to raise my hand. Oh, you can just open up. Okay, so I'm Leon and I just want to say thank you so much for this opportunity to attend this workshop because it's so similar to the research that I'm currently doing. Oh, I must stop my video. We just want to see you for the last time. Okay, no. Sorry, that was the goal. Can you see me? Yes, we can see you. Oh yes, I was saying it's so similar to the research that I'm about to do because I'm going to be also doing research on capacity students as well. So for me it was really really helpful and learned a lot and I feel so motivated to start with my research now like it was really helpful to me. I'm so grateful. Thank you, Elizabeth. Thank you. I'm going to take the last person and then I can review you to go and enjoy your long weekend or go back to work. It will be up to you. Anyone? If there are no other. Thank you.