 can also be converted to text. Okay, so what's the problem with text? It's unstructured. Okay, so it lacks metadata, which means it can't be incorporated into a standard database field. It's still valuable. It's just harder to analyze. Sometimes it might be semi-structured, so it's got a bit of both. So it's got a bit of structure. So yes, we can take those structured pieces and put those into a database. So up here, we see care opinion, which is a website that exists for people to provide their feedback to hospitals on or aged care on as well now. And some governments around Australia are using this as a way of getting patient feedback. Others that's just open for anyone to comment on. So this part is the unstructured part. Someone wrote something here. This part is structured because these are specific responses that they've clicked on. And it's come up with staff numbers, timely care, waiting time. So we could count the number of responses across many people relating to waiting time as an example. That would be easy. How we analyze that, that's a little bit more difficult. So it has a value, text has a value for evaluation. Why? Because it's enabling the person who wrote the opinion or the piece of text or stated it if it's a recording to use their words. It's not a tick box. It's not I'm having to respond to a predefined question with a predefined answer. So it enables emergence of unexpected patterns, patterns that either the evaluator or the researcher perhaps wasn't anticipating, or even the organization that might have engaged the evaluator might not be expecting. And we can detect those. And it can be used at scale. So we can analyze many documents and many opinions within those documents. So how do we, what can we generate? You probably all have seen something like this, it's called a word graph. And you can get free generators on the web. You can cut and paste a text and put it in there or come up with a pretty picture like that. You can do that in more sophisticated tools as well. So that's a starting point. The size of the text indicates the frequency that the word appears. And these were the common words as we take them collectively over all the pieces of text that we analyze. Sentiment analysis, you may have heard of. Think of it as, did I like it? So I've got a positive view of the whatever we're talking about, the topic we're talking about, or did I not like it? And I've got a negative view. Sometimes, I guess, an easy way of describing this so you can visualize it. If you've been through some of the airports where they have the smiley faces and the unhappy faces, it's like clicking on the smiley face, very happy, maybe just happy, or the unhappy faces. What can exist and people perhaps don't realize is that a particular piece of text may be both positive and negative. So it is possible to detect both within text, unlike the airport style, simple click on the face data collecting points. Topic modeling is where we collect themes and content analysis. And some of you may have heard of the software that comes from Australia called Lexie Mancer, where it looks at content. It digests lots of text and it can come up with a social network type presentation where it shows the relationship. And you can create a story based on the common themes that are coming out of the text, as well as looking at how frequently particular topics are in the text. We can also then take the text analysis and put it into machine learning models for classification and prediction purposes and take it one step further and potentially go into AI areas. But perhaps that's beyond evaluation and it's more into just the text analysis sort of things. This is where I'm bringing something new in, as opposed to the conference presentation. So I got my PhD student Curtis to do some analysis and he was doing a couple of health sites. And what he was struggling with was, well, how do I go about it? Well, we need a framework, but I can't find one. So we came up with the DATMAV framework. And yes, we gave it the name DATMAV. So we put some words across there that form an acronym. It's really about ensuring that there's a design process that happens first. Why are we doing this? And as evaluators, we have to ask that question, why are we here? What are we trying to do? So it doesn't matter whether it's for pure research or whether it's for evaluation. Here, we're designing it. So in Curtis's case, he was looking at subreddance or he's looking at care opinion. He's got it for social media analysis. It doesn't matter, okay? You can just put the word text analysis in there and it's fine. So we're also interested in specifying a timeframe because on the web, depending on the source of the material, you can go back quite a way now. And if you do that, you're going to potentially capture a very large volume of data. You may not want all of that. You may only want very old data or you may want some very new data. So then you have to acquire it. If you're doing it like Curtis did, you scraped it. If you're using documents, the client may give them to you in electronic format. If it's a survey, then you're acquiring it through that process. This part here is sample if necessary. What we encountered with some of Curtis's work was, well, we needed a supercomputer to actually analyze the data because you acquired so much of it. So is there a better way? Can I sample? Yes, you can. And if you can sample, then you may only need to look at the sample data. Then you need to pre-process it. So text data on its own is great, but it contains a lot of words that have no meaning in terms of what we need to look at. So we get rid of all the things like stop words. You'll see that in here. Remove stop words. Things like A, it is the, all the words that we use to actually make our sentences make sense and join up nouns and verbs we remove. We get rid of numbers. We get rid of punctuation. We change the case to make it all lower case. And we sometimes remove uncommon words. Then we've got clean data. Once we've got clean data, then we can apply text analysis modeling to it. And we can create some analysis, and then we can visualize it to create things like the word cloud or word graph that you saw before. So we now have a framework for it, and you can access the pre-printed that paper, the links there. And if you can't write it down, I'm quite happy for you to get these slides afterwards so you can grab the link. So what are the tools? Yes, you can read text, but not at the volumes that we're talking about now. You can use Excel. You can use Excel to do some quite limited things like search, term creation, and therefore creating a graph based on frequencies or heat maps, not just on the presence of words. You can get some add-ins such as meaning cloud and do it in Excel and do sentiment analysis. You can use node based programs such as KNIME and Rapid Miner. And these programs are much more powerful than Excel, and they also mean that your workflow is documented, and it's repeatable. There's no mistakes. There is code writing programs such as R, a statistics package, and you can get very specific outputs off for such as Luke, linguistic inquiry word count. So it gives word count plus some other measures. Lexie Manser is the one that I showed before that had the social network word rules. There's a whole variety of specific, almost single-purpose software. Why open source and no code writing? Well, R is really powerful. So is Python, and there's some other packages out there as well now, except you need to know how to code. And a lot of people in the evaluation space might be coming from backgrounds where coding was not a requirement to complete their studies. So that's a barrier. So rather than coming up with a coding option, there are now packages where you can get past the need to code. KNIME is a really good example of that. I'll show you what it looks like in a minute. Open source means it's freely available, and that's great. Some of the packages have limited free access. So RapidMiner is very similar to KNIME, meaning Cloud has a free version, but it's quite limited. KNIME is free, and it's not restricted in terms of what you can do. So RapidMiner only lets you analyze, I think, 10,000 records at once. KNIME can analyze as many as you can fit on your computer until you make it an enterprise option. Once it goes enterprise, then there's a fee that they charge. But for me to run KNIME, I can run it for free. So it's easy to use. So what's it look like? Here we have, if you're familiar with Excel or maybe SPSS or other packages that work with data in a traditional way, they have the data in columns and rows. You don't see that in KNIME. What you see is a series of nodes that are connected together to make a workflow. So here we have a CSV file reader. So we read the data in, we process it. So it just gets rid of some basic things like the stop words. We transform it. We get some frequencies. We're going to get some aggregation occurring here in some color, and we create a word cloud. And we're going to have a tag cloud as well. There is no programming that you need to do this. All you need to do is grab the icons, drag them onto the workspace and connect them. The only thing that you really have to do is specify sometimes limits on what you want. So if you want infrequent words to be removed, how infrequent is a percentage? If it's the file that you're reading, where are you getting it from? So there is a way of pointing it to the right file. So it's very easy to use. And it means it's available to people who can't program. Okay. So ideal for evaluators, it's free and it's easy to use. Some steps to the text analysis, as I said before, that processing or cleaning, we remove numbers, stop words, punctuation, things like that. So it's in a consistent form. We then get frequencies, word graph, sentiment analysis. And that's when we take it into our evaluation or research work. This is an example of topic extraction. And I'll show you some results later on topic extraction. So here it's a much more complicated workflow. And it's not to say that this is not doable by an evaluator. It just means that there are skills and some knowledge that you need to go and get or to learn in order to get to this point. So you can create the output that you need. So the output, it depends on the analysis. I've just got a simple word graph here. This is a word cloud developed from the care opinion data, sentiment analysis, just a simple histogram. We all know what a histogram looks like, I hope. Now, so where, if I'm a new person to something like nine, where should I begin? Okay, I can go to nine. There's a few free downloads about first time users. This is how you do it. Robert Caldy, Cadilly, sorry, created a course that is free how to use nine in 66 days. And it's doing simple tasks generally. So reading the text, making some charts, and it's building up the skills bit by bit, spending maybe 10 minutes to 15 minutes a day to do it. So you don't have to take 66 days, right? But it's just limiting the time you spend on learning. So in two months, you can be pretty much reasonably highly skilled at using nine. And you can do all these sort of things. That includes text analysis, okay. What about more advanced analysis? So R and Python, this is where you can find some really interesting things, but it requires some, I guess, greatest skill levels in terms of programming. So Kurt has created these diagrams. This is a heat map on text coming out of COVID posts. Okay. And would we use this in a normal presentation or report? Probably not because it requires a level of skill to read it. And it looks pretty. And that's why I've included it today, but it's hard to interpret. This one, though, this is where we used word clouds in the very beginning of COVID. And we looked at what was being posted. And we could see that people were losing their sense of smell and taste very early on. In fact, day one, this line shows us what happens with the trend. Day one, a few people are talking about it, but not many. But by day seven or eight, it's peaking for those who are going to lose their sense of taste and smell, and then discussion around it declines over time. So by day 14, and we were able to extract from this particular Reddit, which is a social media platform, the time that people were making these posts in terms of the day of their infections. So we could get a lot out of it. Another more piece of an analysis, more sophisticated analysis, is looking at the things that emerge over time. So this related to looking at prostate cancer. So the initial sort of story arc, dad was diagnosed and a lot of the posts are about dads. Sometimes it's a person themselves making the post, but often it's about dad. The left and mid tends to refer to the position, the prostate. You'll see that we still get some odd terms coming up. A lot of posts were made in January for some reason, and that's entered the text. And this is where having someone with content knowledge, as opposed to just a data scientist look at it, because Curtis is a data scientist, he'll go, yeah, January's in there. Why is January in there, Curtis? Go back and understand the context. Gleason scans or Gleason is scoring, score comes out again, scans. One of the early things you do is get a scan when you're getting treated for prostate cancer. There might be some symptoms and pain and blood, radiation therapy. We get past the some treatment and we're talking about quality of life type issues afterwards and relationships and sexual function. Then over time, we begin to talk about the doctors and the experience and payment as well. So we can paint a quite a complex picture all on the basis of unstructured text, and that has value. Now it's easy enough to do, but one of the things to consider is your computer. So social media posts can be small but numerous, and that can very quickly amount to a lot of data. If you're working with books or published reports, so royal commissions are a good example. I've digested various royal commissions using LexiMantz as an example. They contain a lot of text, so just be mindful of what you're doing in that sort of context. So I can do it on my laptop, but if I'm going to be doing a lot of it and using a lot of text, I might want to get a more powerful laptop. So increase the RAM size amount of memory in the computer or alternatively use a desktop computer to do it. So the more sophisticated the analysis becomes, usually the greater grunt that you need in your machine to be able to do it. That is one of the downsides. Ethical considerations, people often ask about that. As Lewis who gave a presentation to the SA branch recently said, it's still the wild west out there in terms of using social media and text analysis. Some ethics committees are concerned by it, others are less concerned. It may be who owns the source of the data, people being identified. So Adelaide, for example, you can use Twitter data relatively easily or Reddit data easily. Facebook data poses a problem. Survey ended questions are more difficult because you've got to get the survey approved in the first place. So it just depends. Is consent an issue? Well, usually people are posting of their own free will. But if you decide to facilitate that by posing a question on social media, suddenly consent might be an issue. So ethics is yet to be sorted in this space. So even though it's doable, I still think there's a fundamental question. What is the evaluation problem? Why are we doing this? That design question. Is text analysis going to provide valuable insight? If yes, then we proceed. If so, what text am I going to use? Is it a survey? Is it a report? Is it from social media? So there's lots of options. And it's not just obviously social media. It could be YouTube. It could be videos. It could be anything that I can extract text out of. Platforms like Nine, free, easy to use. So it's accessible to you. So before we started, there was a PhD student. Sorry, I've forgotten your name, who's going to look at potentially some text. Nine could be the perfect platform for you. Okay. The other thing, no winter cold and hell. So it does require some knowledge, and it's easy enough to get. And it may be the experience in programming. If you go for the more sophisticated stuff, means you need to get a data scientist in to help. If I was starting again, where would I begin my journey? Well, I've used all of these books. This one's from Nine. It's a very good publication. This one, if you're interested in R, this is a great little book. But it's also got some valuable pieces about using text anyway in it. Practical text binding is a wide-ranging book. It covers a variety of platforms for analyzing text. So yeah, this one probably a good introductory book, maybe a little too simple. Again, this one has some good nine examples. So lots of resources there if you want them. Any questions? Feel free to ask. Not just now, but afterwards you can contact me via my email. You can even ring me. My details are there. So thank you for listening and I'll hand it back to Greg. Awesome stuff. Thanks, Mark. Lots of food for thought there. And what we might do now is get people to just ask questions of you. But I might kick it off because I suspect quite a few people are like me that have over many, many years sat there with a handful or a large pile of survey, open-ended survey responses or whatever and done the crude and rudimentary analysis to identify a bit what you were saying with your clients. Come up with a few themes that have been identified and say, yeah, he or our client, he's the four or five things. When is that sort of rudimentary approach advisable and when's the more sophisticated approach that you've been talking about advisable? So I can move quite quickly now from the crude and rudimentary and to doing sentiment analysis as an example, very easily and at almost no cost, even with a small amount of data. So if I was an Excel user, I might be keen to give MeaningCloud a try. Okay. Just because it's easy, I can use it. Sure, I can't use it every day if I had lots and lots of analysis without buying it, but I certainly can use it a few times a month. And for a small amount of data, it will give me some really interesting results. It will be able to get positive negative sentiment out of my analysis on a more consistent basis that I can. Now it's not to say I couldn't create a method to do it, but that method would vary each time. So there are pros and cons of sentiment analysis because we have a dictionary we use and it is working out whether the text is written in a positive or negative way. So if we're working in other cultures or other languages and we're using an English dictionary, sometimes that can be problematic. But let's assume that we're not. We're just dealing with English. Then I could do that very, very early on and get some incredibly powerful results. And I'd suggest doing it sooner rather than later because I've seen with clients the difference it makes. Okay. We thought we had this, but it brings out some findings that they didn't know were there. To get to the more heavy duty into the world, when would I want to get there? Probably not for a while. Okay. And what sort of volume? The simple analysis of sentiment, 100 records is more than enough. You can even do it on 50. You don't need a lot to be able to get value because once you're getting up to 50 responses, that's a lot for someone's head to take in and remember everything that's been said and work out whether it's positive or negative or how do I keep work out what's the common response and that word cloud sort of thing. So I'd be starting fairly early on. There are some simple tools to use first. When would I go to nine? I'd go to nine reasonably early, but only if I was going to be repeating things and I was interested in going perhaps to the next step. So creating a word cloud, you can do easily enough on sites that do word clouds. But if I'm going to do it over and over again and I want sentiment analysis and I might want to do some other analysis such as topic modeling, then I'd be learning how to do that sooner and have it in my toolbox so I can bring it out at any time. It's not hard to do. You just need to be mindful if your computer doesn't have a lot of grunt, it can take a little while so you might run it overnight. To give you an example, the care opinion data set that we've got is about 11,000 records and some of those are significant essays, some are just paragraphs. So it's a pretty big file and that takes about four hours for me to run on a standard laptop. On a larger laptop, it's no big deal, much quicker, but I didn't get a lot of extra RAM in the laptop I'm using at the moment. So no, it's horses for courses, but unless you put your tongue in the water, you're not going to learn whether this course is for you. So you've got to give it a go. Right. And in your comments there, you did allude to our client reaction on POM presentation results, which pretty much the next question I was going to ask is, because often we have sometimes clients and organisations are understandably skeptical about qualitative data. How has the use of some of these tools changed that or does it change it? So it's interesting. In some ways we're really entering a period of is it qualitative and I guess I prefer the term unstructured data in some ways. Why is that? Because it's a lot of data, right? I'm getting up into the thousands and Lewis has dealt with millions of text records and so we're talking large numbers and once you get to large numbers, we're starting to talk more stats than just qualitative. So there is a maths element to this. So unstructured. Yeah, I think that's really useful. How did the clients react? Well, when you can present a graph or something visual that shows them say positive and negative comments, here's the graph that shows that we've got a lot more positive comments than negative comments. They respond to that. When you add, and by the way, here's some of the text that are aligned to either the negative comments or the positive comments, then they go, oh, now I get what it means. It's combining both, not just either or. I still like to see, so what do you mean a positive comment? Well, here's an example of some text that was rated quite highly as positive. Oh, I get it, right? So they connect with it when I see both, not just the graph or say percentages of positive comments, but the number or graph plus some words. So combined. Okay, thanks, Matt. Let's open it up for broader questions if I can just encourage people who do have their cameras off. It's always nice to have them on. You may not be able to or whatever. That's fine. I know Karen's got problems with her camera, and maybe put your iconic or other hand up. I know Ken, you've already put your real hand up, so shoot or go for it. Thanks, Greg. Hi there, Mark. The question I've got is around qualitative feedback, say in a survey, and the person will say I really like, you know, aspect A, but in the same sentence say, but I dislike or I hate or, you know, this is terrible in terms of aspect B. So in your experience, how do you find statements like that and trying to code them or use the various software applications you've got experience with to filter that? So doing it to the I like A versus I like B, the sentiment analysis will pick up that in the one response, I've got a positive and a negative view. Okay, so and it's not uncommon that I should have preempted that and had a few extra diagrams for you. So it's not uncommon to see, here's a distribution where we've got some purely negative, some purely positive, and a large group in the middle that have a bit of both. What you can do with the text is also break it up. Okay, so if that's your concern, I still can't tell, yeah, no, this person's mixed, right? Well, what does that mean? I can break up the text based on the use of full stops and create two fields, right? So I can break the text into two more pieces. I can't necessarily unless you really knew that everyone's going to be talking about topic A and topic B and maybe it's fish versus meat or something where it's quite distinctive. If it's looking for fish and meat, I can then break those sentences or phrases up based on fish and meat, right? So keywords or full stops, let me break up the text. That would be the other way I'll potentially start to handle that. Okay, Peter. Thanks, Greg. And thanks, Mark, for your presentation. It was really interesting. Just as a bit of introduction, I'm an evaluation specialist at Australia's National Research Organization for Women's Safety, which we call ANROS for short. That's great to be here. Mark, I've doubled a little bit in doing word clouds in R and I've also done a lot of other kinds of text analysis or qualitative analysis, such as conversation analysis, critical discourse analysis and so on. And I'm interested to know with something like NIME, can we do analysis that shows how different groups of people within our data have expressed particular views or sentiments or whatever? For example, suppose we had responses from both doctors and nurses, could we show how their responses differ? In other words, can we sort of correlate with other variables in our data? Yep. So that sort of analysis is really easy in something like NIME, assuming that in your data collection, you've got your text and you've got some other variables along with it. So something to say, I am the doctor, I am the nurse, I am, if that's there. So I've got a whole lot of variables might be about, particularly if it's safe from a survey. Now, here's my responses and one of them is my role and here's an open-ended response where you've asked a question and I've given a whole lot of text. Yeah, it's easy. It's trivial. You can filter it on the basis of the other variables. Okay, yeah, Candice. The dreaded, not muted Candice. Thank you, Mark, for your presentation. I had a question, I guess, in a similar vein to what Peter was mentioning before around working with, I guess, either culturally diverse or diverse, just diversity in your data set or whether that's people in vulnerable situations as well. There's a particular level of sensitivity or cultural awareness that needs to come into your thematic analysis, I think at some stage. Do you have any, I guess, resources off the top of your head? If not, we can chat later about deriving, I guess, understanding the language side of things or like the meanings and particular meanings behind the way people communicate because when we start getting into the AI-based stuff, they're not as savvy as humans when you start looking at idioms and double negatives and stuff like that. And if we're, for some of us, if we're not working with large data sets that require that and you don't have a language background, it's pretty challenging, I think, to work in a space to kind of derive meaning if you don't have that on a structured or technical, whether that's language or AI-based. So one of the, no, one of the key things that comes out when you're reading particularly things like sentiment analysis is the notion that language matters, right? Australian is a famous for sarcasm and that notion of saying something negatively, but it actually means the opposite of what they're saying. So how do we cope with that? That is still a problem. There's still training data which is used to build the libraries for these analysis platforms. Okay, so many of the libraries come out of universities from the States. Now, these libraries are being expanded. There's also the ability to add to the libraries, but as an evaluator, if you're doing this as a consulting piece of work, be that within an organisation or for an organisation, you're not going to go and say, oh, this is great. I'm going to use this tool. And by the way, here's the time we need to invest to create this library. Okay, so there are limitations. Yes, I've given the overview, but you do have to be aware of, okay, we live in a multicultural society. What does that mean in terms of representation? If you're using social media data as an example or blog posts, who is writing on these posts? What's their background? What's their language? Is it representative? So care opinion, the one that we've used in some of the examples there, you would say whoever is contributing to that is probably from an English speaking background in Australia. They're probably reasonably well educated. So is it representative of everyone? This is where the volume suddenly becomes important. If we get enough messages, maybe that's less of an issue for picking up some aspects that we're looking at. It doesn't mean we ignore other ways of collecting data, but certainly it's not the be all and end all. If I'm looking at a survey, that's different. Then I can come back and say, okay, my survey went to a particular population. It had a structure like this. And yes, we did get the responses and yes, we did allow for language. How do we capture the open-ended? Did those people who perhaps didn't have high English skills not give us any responses to our open-ended questions? We can look at that. This is an underwhelming response in our open-ended survey question. Why is that? It was a difficult topic. Our population was not going to respond or couldn't respond. So yeah, there are challenges. They still exist. That part hasn't been cracked. What has been cracked is some of the software that enables us to do some different novel and maybe better analysis is now available. But we're still going to grapple with the culture and language challenges. Thank you. We've got someone else who would like to ask. Gabrielle. With NIME, I haven't used the package, but is it better suited to single users or does it facilitate a team type arrangement? Okay. As a single user, I can create a workflow and I can copy that workflow and give it to others to use. That doesn't necessitate me paying for NIME or anyone else in my group paying for NIME. If I wanted to have a platform for the enterprise, so it could be a small group of people, where we share on a common workspace and collaborate and I build the model but you can play with it without me giving you the actual model and then you've got to pay for it. So it depends on how you're wanting to collaborate. So I guess another way of viewing it is like Google docs. Yes, they've got their spreadsheet. Everyone can contribute to that spreadsheet that's left on the Google Drive if they've got access to the drive. That's one way of working or yes, I can have Excel on my machine send you the spreadsheet, which is it that you want. So yeah, it's free if it's only on my machine. Yes, I still can share my workflow. I still can share the data that I'm using, but then you have to go and rerun the analysis providing that that's not a hugely time-challenging thing to do. I don't see a big deal on everyone having their own copy. Thank you. Anybody else? I was just going to ask Mark if you had any comments on Envivo as an alternative? No, I haven't. I guess Envivo still requires the user to code. When I look at things like what we've been doing, what I've shown here is that notion of coding the text is taken away from me. It is using the underlying text to come up with the topics. Okay, so I'm not specifying anything apart from here's the text. All right, I might say there's some words I don't want to include in the analysis or there may be some common acronyms I might want to add to the library. But beyond that, the analysis occurs. I do not specify what the topic should be. That comes from the data, not me saying here's my topics. Okay, so that is a difference. That's fascinating. Thank you. Thanks, Mark.