 Okay, so how deep is deep learning? That is the title of my talk. I'm Amar Lalwani, lead RND at Pantoot. I'm also pursuing my PhD at RIP IIT Bangalore. So let me start with a question. How many of you have been in a situation in a classroom where you wanted to ask a question, but you were afraid or you were shy and you couldn't ask? Show of hands please. So almost everybody of you. Now, I'm not here to give an inspirational talk saying that, oh, you should have asked that question. What I'm going to tell you today is maybe you were right. Maybe I was right by not asking the question. I mean, the fears or the worries which we have are not imaginary. They are actually real. And I think we might have asked that question. We might have seek the answers. If I knew that I was only one listening to the teacher, if I knew that I was only one and there was no other student in the class so that my question don't look dumb. If I knew that I can get a straight access, one-on-one access to the teacher. But is it really possible? Does it really work? Is it feasible? So this is the current trend in classrooms. Homogeneous teaching. One teacher, 30, 40, and in some cases even more than 50 students per teacher. And each one of us in the classroom is expected to behave in the same way. I mean, for an optimist, a half-filled bottle is actually half-filled. For a pessimist, it's half empty. And for me, if I'm thirsty, it's just a bottle of water. And we all are expected to work, I mean, learn the topic in the same way in the same amount of time with the same amount of proficiency in each of the topics. So at the end of the year, even if you are at the 40% proficiency or 80% or 90%, you are pushed to the next grade. And likewise, your gaps from 40, 30, it grows in increasing and increasing with each year. This brings me to a very landmark study of a great educationist named Benjamin Bloom. This study was in 1984, where a comparison was done between different styles of teaching. Now, the graph, the bell curve on your right-hand side, it shows the scores of the student who got one-on-one tutoring, where there was one teacher, a same set of teachers, by the way, for the same study, where there was one teacher for a maximum of three students. And the graph on the left-hand side, the bell curve, which is a little bulged, got the conventional teaching, where the ratio for teacher to student is 1 is to 30. Now, you can see here, the red line is the mean or the average of the score got by the students who are tutored in a one-on-one fashion. So even the average student in a one-on-one tutored group was able to perform better than the 98% of the students who got teaching in a conventional manner. So what does this tell you? That if a student is not doing good, it's not his or her fault. It might be a fault of the teacher or the teaching process. Let's take a look in another way. Let's have another perspective. So 80% of the students in the group who had been tutored in a one-on-one fashion are actually performing like top 20% of the students in a conventional fashion. Now, what does this tell you? You can see that the variation of the bell curve here is very less. What does this tell you? Any student can perform better. Any student can do wonders if the teacher or the teaching process was right. There is no notion of poor child or a less intelligent child or a less smart child. That's what precisely we are for. Funtude is an intelligent tutoring system where we believe that every child is unique and we embed mastery learning, which I just mentioned, that you need to first focus on the mastery, even if it takes a little bit longer for you because you are unique, as well as one-on-one tutoring. Funtude journey so far. We have been accepted in 2012 and in five years, we have served around more than 100,000 students with Funtude has assisted in more than 50 million instances, as well as the students have spent around 6 million hours on the front. Now here we solve many complex problems, but today I'm going to talk about a specific problem. Let's go into detail of that. There is a student on Funtude. There is a student, maybe a class of elephants and let's take the fifth elephant who was working on Funtude and there is a student who is doing like this. There are eight questions given to the student in a sequential fashion and the solved and unsolved is the response by the student. Now at the end of this sequence, can you tell me something about the student like what is the knowledge state of the student or how is it going to perform if the next question is given? And by the way, all the questions, if I tell you, are related to the same skill or concept. By skill I mean addition, subtraction, multiplication or you can be more granular like addition of two-digit numbers, or context to choose. The best way would be to just take a peek into a student's brain, but is that possible? Or other thing is to directly ask the student like how much do you know? But do you really know what you don't know? That's still an open question. So let's make it formal. This problem is referred to as knowledge tracing. Don't worry, it's just a fancy name of time series prediction problem where based on the n attempts n attempts were taken for a particular exercise or a particular question of an exercise or a skill. You need to predict what is going to happen for the n plus one attempt. Like this zeros and ones here show you the correctness of each and every first n exercises. And from that, how can you explain learning? How can you predict learning or how can you measure learning? Now there have not been many breakthroughs to understand and explain the learning process of humans. I mean we talk about machine learning, but let's first talk about human learning. The cognitive scientist has still not been able to find out how do humans acquire knowledge and what explains actually and how do you explain learning and what do you actually say that I have learned something or I have acquired some skill because there is always an improvement possible. We never stop. So obviously no assumptions or no preconceived notions will help. We need a fairly complex model, fairly general model and fairly powerful model. Now this brings me to a very famous code by Joker. The only sensible way to live in this world is to live without rules. A Joker card in the deck of cards actually can play the role of any other card. It is very flexible at the same time powerful and in data science that Joker is deep learning. Now let me go into a bit detail about recurrent neural networks because for the time series problems it seems the recurrent neural networks are working well. There is a history about that and even in the education sector you know that the learning is long-term dependent. I mean if I had done something three years back that might also help to solve me a particular problem today. So it's really long-term. So sometimes even from the RNN class of models LSTMs tend to do well because LSTMs are proved to model long-term dependencies. But let's first go down into the recurrent neural networks. I think most of you might be knowing it but still for the sake of discussion here obviously let's consider this is our neural network where there is just one input there is one output obviously and a hidden layer with just two hidden neurons which are recurrent neurons. By recurrent we mean that there is a hidden connection or there is a connection from a neuron to itself. If we unfold in time there is the connection from a hidden neuron from time step one to work to the hidden neuron to a time step two and it can connect to itself or it can connect to the other neuron as well. So at any given point of time the output neuron is triggered or is a function of the input at that particular time step as well as the values of the hidden neurons in the earlier time step. To put things into perspective how do we see it in this particular domain? So let's consider there are three skills and for each skill there are many many questions available. So let's say there is a pre-trained model available to us and this is our hidden state. By hidden state here you can consider or imagine the hidden state as the knowledge state of the skills present in the student. So before even coming to front route this is like the initial knowledge state. So for skill A consider the output with three neurons so the first output is actually the probability of correctness given a question of skill A. So that is one. So maybe that skill is very easy. I mean the student doesn't need to learn anything. He already knows it. Point three for the second skill B, point seven for the third skill C. Now if the student is given question one which is belonging to skill A what happens? By red color I mean that the question was answered incorrectly. So you can see there is a drop in the probability and if you notice here the drop in the probability is with both skill A and skill C. So skill A is not only contributing to the drop of probability in the skill A but also contributing in the drop of probability of skill C. So likewise you can figure out the influences between skills. With the second question as you can see the incorrect answer of skill B question two is actually decreasing the probability of all three skills. So maybe skill B is influencing all the skills. Maybe it's a bit general skill which applies to all the sub skills. With question three answered correctly you can see there is an increase in probability of skill B and skill C. So likewise as the student progresses you can build a learner profile. You can build a profile of strengths and weaknesses of the student and help the student better. That's why it's called knowledge tracing because you are tracing the knowledge across the interactions. Now just to give you a quick architecture we have an input layer where we are going to have the neurons which are directly proportional to the number of questions or the number of skills you can choose either of them and we will have a hidden layer where generally the thumb rule is for at least this domain in education where the number of neurons are directly proportional to the number of skills in the order of 5 to 10. And that number actually, I mean this is again a thumb rule it depends on how interlinked your skills are. If your skills are very independent a smaller number might work. If your skills are very correlated, heavily correlated you need more neurons to kind of encode and model those skills. Now let's dig a bit deeper. What is a hidden connection? By hidden connection I mean a connection from a hidden neuron to itself or other hidden neurons. What does it signify? As I said earlier, assume that a hidden neuron is resembling a skill. So when there is a link between a hidden neuron to another hidden neuron that is like the correlation between two skills. That is like how does one skill signify or influence the learning in other skill and if the link weight is high that means both the skills are correlated. Maybe both the skills can be acquired in a parallel fashion. I mean you can give the questions in an interlude fashion skill one question, skill two questions and they are independent. Or maybe the learning of skill A implies that the learning of skill B so you don't need to test on skill B if you know already that the student is good in skill A. Likewise you can infer such relationships to talk about our experiment. Let's first see what is the data set. We took the data set of 6th grade math CBC curriculum from around 8000 students from across 180 schools for which we have 6 million data points and around 75% of the data belongs to class 1. Now again in education domain this big data set is rare fine. Results. Now you can see that I have given the result of deep learning model here but how to evaluate that your model is good. You need a benchmark to compare with you need the ground truth. How do you compare and judge and take a decision whether to deploy such model or not for that you need a conventional model deep knowledge tracing and deep learning was applied in education sector very recently by Google and Stanford guys deep knowledge tracing is by those researchers but a very very successful model called Bayesian knowledge tracing is being there in academics as well as in commercial that is very successful will recently come we will very soon come to that but we will use that model to compare our D.K.T. deep knowledge tracing. So now you can clearly see that deep learning is actually getting a 15% gain over the shallow model from now we will refer the deep D.K.T. deep knowledge tracing as the deep model and other models as shallow models. Now last year a bunch of researchers actually went ahead and did a careful and detailed analysis between these two models and actually figured out what is that which is missing in the shallow models because in education as we saw in the earlier talks explainability and interpretability is the key. You cannot just you cannot just form a lesson of questions. I mean teacher won't accept you need a better understanding of the learning and you cannot I mean we can get into ethical and ethical issues because this will go in that side. So the so they did a very careful analysis to find out what is actually that is not well with this model and how can you actually improve such models and for timing we will call those models as shallow star because that is a little enhancement in the shallow models and now you can see all the both the deep learning models and the shallow star models actually perform equally well. So then do you need deep learning models really? Let's to discuss that and to answer that question. Let's park it aside and we'll go and understand what is a shallow model. So this is our shallow model which is known as BKT Beijing Knowledge Tracing where so just to start with this is a skill specific model so a different model for one skill while in DKT deep knowledge tracing you had all the skills and all the questions and the full sequence of the child showed up into the model. So for a given skill a student can lie either in one state of the two learn or not learn. So this is a binary state. I know we don't operate in real life like this. I mean you cannot say no driving or I don't know driving. You actually improve but this is very simple model for the sake of simplicity explainability. So the student can either line one of the two states doesn't know and know and there is no forgetting. I mean there is no transition from learn state to unlearn state. So once you learn you cannot forget that's the that's very surprising and as well now what does that mean the transition from unlearn state to learn state what does signify it tells you about this skill if there is if P of T is more here it tells you something about the skill that this skill is easily acquireable. I mean there is fairly likely chance that given one more question of this skill you can actually jump from unlearned state to learn state and if it is less that means the skill is very unacquirable. I mean you need very many questions to actually reach the mastery state and how do we place the student in one of these states that is the initial probability to start with which is P of L0 which actually signifies how difficult is the skill. I mean what is the probability that somebody comes with the background of that skill if the probability is very high that means the skill is actually very simple maybe the student already knows it why do you need to give more questions as well as there are two other things being in learn state you can actually answer the question incorrectly which is like doing a careless mistake which is like slipping by mistake or something else that's why we call it's P of S which is probability of slipping the other thing is jumping from when you are in the unlearned state and you do correct answer which is just pure luck or pure guessing that's why P of G, P of guess now just to put things in perspective two state HMM model which is known as Bayesian knowledge tracing four parameters which are very interpretable for the content writers or the content authors or for the teachers this is very useful information which tells pretty much everything about the skill to see now which one to use I mean we how to decide shallow versus deep in terms of parameters or DKTS few hundred thousand parameters obviously you need GPUs you need high amount of resources to train them and our shallow models are actually four into number of skills four parameters per skill so if you are talking about thousand skills even then we just have four thousand parameters and they are very interpretable by the way so that's a clear advantage that's why interpretability but in terms of performance both are actually the same so now what is the why should we go for deep learning is it the need is it an overkill here it turns out deep learning gives you a certain advantages like you can do intelligent curriculum design the output of the neurons or the output layer gives you a probability for each skill which you can use by your reinforcement learning algorithms to actually design the problem in a in a game fashion where you give a move your movies actually giving the next question you pick carefully and where the students movies actually taking an attempt to the question answering correctly or incorrectly and likewise the game follows and the systems goal there is to actually help the student reach to the mastery state and that's why we call it a learning path where you need to find best sequences and a best learning path for the student to help student reach the mastery state very quickly that can be your goal if you are modeling this problem by reinforcement learning discovery of structure now we saw in the DKT slide that you can place the skill labels in the input layer or you can also mask them and you can just use question labels considering that you don't have the skill information in that case you can actually figure out how is the structure I mean in any educational product or in any school or a textbook you first decide the index how are the concepts related to each other what should be taught first and how is the micro granular concepts related to each other if this is a mistake where should we go if this is the mistake where should we go but it seems this can be figured out by the machine so no need for expensive expert models because the models generated by deep learning are actually very close to the models generated by experts and that is where we spend the most of the time to design a good domain and expert model the last thing which is fairly obvious for each input you can have a complex representation I mean you can have more neurons in your input and you can specify all the relevant information you think about the demographics of the student or how is the student doing on that day or demographics of parents and things like that but in the shallow models we just have one input the correctness and incorrectness so this can consider very granular and very detailed representation of the students input so this brings me to the obvious question is deep learning really deep I don't know maybe not because the results show us that shallow and the deep learning perform equally good so one needs to carefully decide and understand the domain first understand your data first and understand your assumptions we started the assumption saying that the learning cannot be binary but here we saw that the learning where we considered the binary states actually those shallow models are giving the same performance so maybe deep learning the way people said that this is the remedy and the solution for all the problems I don't think it is the panacea these are the references the last one is our work which was just published in the last month which actually where we actually extend this work by considering one more model called performance factor analysis where we actually show that the performance factor analysis which is much more shallow which has the three parameters is actually defeating deep knowledge tracing. Thank you. Next up we have a beverage break so those of you who want to ask questions can stay back we can continue them for a long time. First one I want to have a question. First thing is a general question to you. You told that if a student do not perform that is the problem with the mastery or the teaching process do you think if you take the question as a general do you think it is true because the same teacher is teaching many students and among them some of them are performing and some of them are not performing so if I ask you generally do you think it is true? This is the study showed by our the pioneer of education domain now if you ask me personally the thing is by my talk if somebody if I give the talk to you in person I might first know what you want to know and what actually you think about this topic and I might change my behavior and my tutoring strategy I might ask you the questions first and then deliver my talk but I cannot do the same with 50, 100 or 200 audience so yes it depends on me how adaptive I am it depends as a teacher how can I cater to each and every student and the second question is that RNN has become a de facto standard for these supervision rules in education the RNN has become standard for these solving such type of problems on education do you think this has become a standard for that? I don't think so this is a very recent introduction but this is in any domain this is just the model for time series prediction so this is also a time series okay so thanks yes time for time series RNN's LSTM's work bill yeah hi two quick questions so one is so what are skills in your context meaning what do they mean so yeah I have expected that down for this talk but in our context we have a fairly granular ontology where there is a topic where there is a subject oh there is a curriculum there is a subject there is a topic there is a sub-topic there is a sub-sub-topic and sub-sub-topic just to give you an example it can be addition of one digit number another sub-sub-topic can be addition of two digit numbers now wait inside that also you can have granular learning gaps what we have the USP of us is a learning gap for example if I give a question I don't care what your right answer wrong answer is generally wherever you go there are radio buttons for options the system just tells you this is right this is wrong that's all but we need to find out what's the possible reason behind that that we call as learning gap the authors actually sit and tag each and every incorrect pattern and for this incorrect pattern this might be the reason because we need to explain the incorrect response so you are so in our case learning gap is the skill okay and that is that is that is annotated by your human explains right which can be actually figured out by deep knowledge models that is you're saying the relationship between those skills is what you're able to discover and I forget other question it's okay we can meet offline yeah hi I want to ask like what is the size of this skill set or something and also like how do you get I mean some skills could be wrong right I mean it's not how do you have this standardization like some schools teach certain things in one way because they feel that's the right theory or whatever and some schools teach something else how do you have a standardization on the skills and what is the right way to do something like the right like basically by skill I guess you mean like algorithm that the student is using to solve some problem like there can be a shorter method to solve it using a different skill or something so how do you have that standardization so the answer for that is it's again the process is followed by the cognitive experts by that I mean the experienced teachers now we have a fairly granular process where teachers collaborate and tag each and every response each and every question but for the standardization we have been for CBSE curriculum we have around 100 schools and our skill sets is fairly accepted over all the schools but that took us a lot of time because with initial stages we had skill markings but teachers can actually come and repair that maybe suggest another combination and we have algorithms also like with outlier prediction you can actually find a question which is not belonging to that skill because that question is showing something else that question might come up as a very difficult question why this skill might come up as a very easy skill so clearly this question does not belong in this skill so it is both ways it took the experts to reach to a good state in a very long time and we augment that with data science by doing such outlier predictions so what is the size of your skill set? so for this study for 6th grade math curriculum we have 442 skills and 1523 problems so purely in this term it's just a dimensional reduction I mean you can actually do 1523 problems and when I said that using deep learning models you can actually find the latent skills which is close to 440 so if you don't input the model, the skill levels then you just actually input 1523 levels so in that case you can find 440 clusters or something like that actually my question was almost answered when you answered his question, my question was when you're talking about skills and you're talking about so as far as I see it it's one topic that you're talking about as one skill so in that skill set the level of questions is not always going to be the same so you can't say that you've learned this particular skill and this belongs to some other skill set because it's an easy question so what level are you categorizing those skills at was basically my question so two things, one is the skill is not a topic here that is fairly abstract, we have skills at a very granular level like addition of one digit numbers or subtraction with borrow, subtraction without borrow, subtraction of two digit numbers with borrow, subtraction of two digit numbers without borrow and for each such skill we have been following Bloom's taxonomy revised Bloom's taxonomy where it says that analysis can be acquired by six steps, first you remember the facts then you understand the facts, then you apply the facts then you analyze the facts, then you evaluate the facts then you create the facts, creation is like an innovation or research, I mean you can contribute back by generating new questions, so we have questions tagged by the experts in each of these domains I mean in each of these six parts for each skill so for each skill we have questions at different level and different cognitive ability hello hi, Manus from Episodes, nice to talk about so I'll slightly veer away from the topic and this domain that you're working on so this comes from the last talk that we're talking about explaining models and your topic about how you are able to say that these are certain skills and their weights keep on changing after each prediction, ok so correct me if I'm wrong, so what I was kind of imagining in my mind is if I am using this technique so it's a time series prediction problem so if I translate that into a sequence prediction problem and my proxies for the skills would be my features and the weights that you're seeing and I'm assuming they are essentially your hidden layer weights which are getting changed after each time step ok, so if I'm able to see these predictions or the probabilities of each of the features changing after each time step of my sequence labeling problem I should be able to essentially say that the probability of a certain feature is massively increasing after this prediction then this is a feature which I need to be careful about or this is a feature which is affecting my prediction entirely do you think that is possible or I'm kind of actually I didn't understand your question properly but one thing I would like to correct here is the weights are not changing the value of the hidden state is changing I mean the weights you when you train at that time you have to change at that point right by training they will change but they will stabilize what are the weights? they are the final weights I'm not talking about the final weights I'm talking about the weights that are there after each time steps so after each time steps your weights have to as well update right because you are it's not the weights the output of those neurons I mean so you finally would be a softmax layer right which is a probability is that you're extracting sorry the final layer is a softmax layer right it depends when a question can have more than one skill as well so we cannot have a softmax over there it depends if we are doing at a question level that question answered correctly or incorrectly then we can use softmax but when we have like once question can have more than one skill then we cannot use softmax so are you able to see in this framework are you able to see any dependency on the prior five predictions on the current prediction are you trying to follow that that we have not tried yet but the predictions the accuracy of predictions improve with time and that is not very long I mean first one or two predictions are sometimes off track but after that the improvement is not seen with each and every attempt I mean it's pretty stable yeah so they would be a global prediction matrix right accuracy of scores precision they would be a global not at each distance whichever I have reported here that's global okay so for the skills you have dependent only on the teachers to list it down right sorry for the listing the skills 442 skills that you have mentioned you have when kind of dependent on the teachers to list it down and do all the things right so how do you how do you get the list of skills which are the skills that you are kind of considering that our in-house team of teachers and cognitive experts develop that so as you just mentioned this is not exactly a question just an observation as you have mentioned mentioned that 15-20 the questions they can actually cluster around 442 and each of the cluster probably will connect to one skill right no more than one skill okay so one or more than one so yeah just a kind of curiosity if really we don't start with the skills kind of labeled by the humans instead we just start with the questions and try to figure out if really we can come up with a skill set like 442 it will be like combination of one or more fundamental skills then the whole thing can be made like fairly autonomous so I mean just a kind of question have you ever tried to that's precisely the second point that you can discover the structure if you just input question labels that means you are not actually telling the model that what are the skills associated with it right but the second so that's one point which I said that why should you go to deep learning model even when you know that both of them are giving the same performance so that is the advantage which you get by these models but other thing is this is a chicken egg problem I mean this can only be done when you have the data you are starting afresh you cannot do such stuff so that's why even I mean in the journey down the line we can do this now but we could not have been able to do this at the beginning you need data for this it is an interesting study to find out when you have a lot of data whether you really can construct the skill set you can really really can construct the skill set without human intervention what I am trying to say is that maybe when you have a lot of data at the beginning in some certain scenario then probably you can find out the kind of analogy of a deep learning and shallow learning model by first determining the set of those skills like here and then come up with a shallow learning model without bothering much about the deep learning yeah actually we use a mix of them in production or in commercial setting we never go with just deep learning model because it's little complex to implement that and it's unnecessary I mean it's an overkill here clearly to know more about the domain and to find more relationships it's useful so that's why it's a trade off I mean thank you Hello I have a question as you categorize the questions into different levels so suppose in the hardest level where you will qualify for the next skill set that you are telling for you have five or six questions as we have seen in our school times also does if a question is hard for us will still take a go at it means the guessing part that you considered in the means as a probability in the not in the deep learning matrix in the other one is still be there so how will you handle that part suppose student takes a go at it and gets correct on all of these and you upgrade it then there's still the 30% learning gap that you mentioned in the conventional stage is still remaining here still not covered that part will still be a black box for it so how will you cover it first thing this is a probabilistic model so you cannot really say there is a 30% gap it's all probabilistic way of measuring learning I mean there is a probability of you lying in learn state and one minus that probability of you lying in unlearned state and when you calculate the correctness of the probability you are actually considering guessing so if I tell you that the correctness probability is 0.9 I have already considered the guessing part so I am already considering that you are the probability of you getting the answer correct is 0.9 because this question of this skill has higher guessing ability so that is already considered and yes nothing is perfect so at least this is better is what we are trying to say this is the motivation for the problem all right