 I think we'll go ahead and make a start. Today we have two short presentations 15 minutes each, we'll have Q&A after. The first one is AI Foundation models in generating programming assessment specifications and model answers by Amal. My name is Jellie, Eric Atwell and Amar. I can't see that. So go ahead. Okay, good evening. My name is Amal Al-Qajali. I am a PhD student in Leeds University. I have done my master's degree from the City University of New York in the US. Recently I have joined School of Computing, Artificial Intelligence Lab. Thank you for giving me the opportunity to share my research. So what I am going to do, I represent, I briefly, sorry, sorry. What I am going to do first I will present briefly my research. Then I will go through an initial experiment I have conducted. Afterwards I will spend a few minutes sharing some thoughts and consideration. And by the end I will be more than happy to receive your feedback, to listen to your feedback and answer your question. So my abstract which has been accepted in the ALT conference. It's entitled AI Foundation Models to Generate Programming Assessment Specification and Model Answer. So what is foundation model? In August 2021, this term has been announced by the Stanford Research Center for Human Artificial Intelligence Research, referring to wide artificial intelligence model that can be adapted in a huge data set, which can, that can be trained in a huge data set and can be adapted to a wide range of downstream tasks. For example, deep learning neural network transformer models such as chadGBT and BERT. And we have, as we have seen that transformer models have outstanding performance on understanding natural language. And however, they do have some limitation on transformer models. However, they do have some limitation on transformer models. They do have some limitation on generating code. So we propose a foundation model approach to speed up the time consuming for instructor to sitting and grading programming assessment in large courses. There are many research has been conducted on auto grading system. However, they do have some limitations. One of the primary one of these limitations that these auto grading system is a primary dependent on the nature of exercise. The other limitation as, for example, if there is instructor is familiar with Python programming language, he is giving, for example, courses for programming language. It will be quite difficult for them to use the auto grading for C programming language, for example, because instructor need to follow some technical, they need to have some programming technical skills to use the auto grading system. Also one of the other limitations for the listened auto grading system. Instructors spend lots of time to feed the auto grading tool with the correct programming implementation. So the auto grader tool can compare between the student as some submission and the key answer. So we propose the foundation model approach to reduce the workload for instructor and by saying we would like to reduce the workload for instructor. We doesn't necessarily mean that we would like to eliminate this kind of shop, because as we have seen there has been a debate that this kind of a new artificial intelligence or generative AI has the potential to cut off the jobs. And, in fact, it's true that the last August there has been a report that has been announced by McKinsey that by 2030 generative AI has the potential to reduce around 30% of hours of the workload. And by the way, McKinsey is one of the most prestigious consulting management firm around the world. They have offices in more than 65 countries. So since the, thank you, since the large language model has shown remarkable performance on understanding natural language, and they have the potential to reduce the workload. So is it possible for large language model to reduce the instructor workload? So this is the question that my thesis seeks to answer. So I have done an initial experiment, a very initial experiment. And at the beginning for any instructor to mark student work, either it is programming or not a programming course, they will compare the student answer with the key answer. The first thing come to my mind, so how can I let the machine or the large language model compare between the key answer and student submission and I need to find out some calculation to result the student score. As I did, I have done a synthetic data by generated fake answers. I have assumed that I do have four students and they do have four question. I just write down the exercise on the chat gbt. As it is in the exam the prompt was in natural language and suppose give me the output is in programming language. As we can see on the right side and there is generated answer by LGBT and on the left side is the human answer. And there are some similarities that similarities between the two answers. However, the generated AI, it doesn't generate any comments from programming language point of view, it doesn't make it doesn't necessarily if it has, I mean, it doesn't affect the correct of the code if it does have a correct comments or no, because those comments are not executable. And the machine basically are just ignoring them. I have done some static analysis for the, the, the answers here for example, we do have the question I do have only four questions. Here are the key answers by human. And here are the four answers are generated by the chat gbt. This really make me, I have to take it into consideration during my study, because generative AI or transformer model when I try to generate answers. The first thing it give me for, for example, here it gives me around 250 words. And for the same answer. However, when I try to change to generate another answer it give me only around 20, 20 words. And it's obvious that an answer 2050 words is not equal as an answer it's there's something wrong here with the two generated as So this is one of the things I have to take into consideration for the generating various range of words in generating answers. So I have used expert, expert is modification of birth transformer. And I try to find cosine similarities between the two documents, the two, the two documents the fake answers by generative AI and the key answers by human. What I have noticed that expert process the document as pure, pure English as natural language it doesn't consider any syntax or any semantics off of the code. And also it shows that comment has a major effect on student work, because what I have done I run the process for I generated. I compare two documents one coming one, one document, which include instructor answer. And the other I just omit the instructor answer to see how far these few sentence can affect the similarities, which I will which the large language with the transformer will reduce. For example here for question number three large language model provide me 85% when it's compare the comment. Okay, from educational point of view commits and source code gives instructor a knowledge of how students can understand this piece of code, but from programming language of buoyant it doesn't, it doesn't necessarily affect the cause it doesn't it isn't executable. However, transformer. Perform these documents. When I tried the same question with the same data with ignoring the comment it gives a student a score 74, which means see. So the comment a transformer consider the comment has a major effect on students score by the end of the course, because 74 comments. And if the comments affect the students course these a few sentence it doesn't consider a major factor. Another thought that for the same process for the same data. The transformer expert it doesn't affect the comment doesn't affect the student work for example here it give me 85 the another for the ignoring it give me 87 so student have the same score. So it does. So I think that large language model it does have the additional to detect errors in student in student work. However, I have to take into consideration that I have to use a programming language by model by model transformer which is called to take us which the input is natural language and the output code so the machine can detect and understand semantics of the programming language, such as code T five also I had to find a middle approach between the middle approach between students submission and key answers so the large language model can follow. Thank you for listening. And I'm really happy to receive your feedback and listen to your question and by the way also this photo is generated by me generally. And they have the prompt on the slide. I'm happy to receive you. Thank you. Yes. Yes. Yes. I'm happy to receive your feedback and listen to your questions. Thank you. Thank you so much for your questions. Thank you. Thank you very much. Thank you. Thank you. Thank you very much. Thank you. Thank you very much. Thank you. that the guy who was talking about was there's been such a rapid development where when people were turning into assessment to a side right as a student, let's put like a prompt and you get an essay out and we're going to almost critique it and when she originally did it, it was terrible and there was a lot of things for students to pick up on, but when she did it just before she was teaching, it was like a pristine almost perfect essay. So it completely changed like how well it did months ago to how well it would do now as if the kind of technology evolves and that. So I'm just wondering really long time as a prompt here to then how do you think people are like involved with this kind of problem where it didn't perform as well as you'd expect. So how do you think as a sort of technology rapidly advances that it will kind of like change like what you found here? I think you are right because the generative AI it depends on the data it has been trained on. So I think this is something I really need to consider that the advance rapid advancement in the artificial intelligence. But I think so far for the for my research, I wouldn't be that far to change the the output, I mean all the results because transformer models such as chat GBT or whatever they do have some huge data and huge computer and that they run on it.