 Hey everyone. I'd like to introduce our next speaker. His name is Mark Musin. He's Dr. Musin's a professor of biomedical informatics and of biomedical data science at Stanford University where he's a director of the Stanford Center for Biomedical Informatics Research. Dr. Musin conducts research related to open science, intelligent systems, computational ontologies, and biomedical decision support. His group developed Protege, the world's most widely used technology for building and managing terminologies and ontologies. He has served as principal investigator of the National Center for Biomedical Ontology, which is one of the original national centers for biomedical computing created by the U.S. National Institutes of Health. He also directs the Center for Expanded Data Initiation and Treatable CEDR, which is the talk he's going to give now, founded under the NIH Big Data Knowledge Initiative. CEDR develops semantic technology to ease the authoring and management of biomedical experimental metadata. So Mark, without further ado, please take it away. Thanks for awake. I'm really excited to be here and I think this is really a fantastic opportunity. I'm really glad that CMU has recognized the importance of bringing together people interested in both AI and data management and data reuse. I'm going to talk to you today about CEDR, which is a project we've been working on at Stanford for the about the past five years. And I'm going to be talking to you about the problems of accessing data and why getting data, at least experimental data online is so hard and why I think data reuse is such as challenging and important problem. Earlier today when Melissa Handel gave her talk, she made a statement that we basically need to have better data in order to have good AI. I'm going to argue in this talk that we actually need to have good AI in order to generate even better data. And the emphasis here is on what AI is going to do to make the online data available for all of us more useful. We've also heard a lot of talk today already about the fair principles and the idea that data reuse requires data that are findable, accessible, interoperable, and reusable. And that mantra has been just sort of with us for the past five years and everybody agrees that if we want to be able to reuse data, we need to have fair data in the first place. Great idea. The problem is when you look online and see what is actually available for reuse, almost all data are not fair. They're not fair because the fair principles are in their own way rather obtuse. They're principles. They're not easily operationalizable in a way which makes it easy for us to take these principles and ensure that the data sets that we create are really fair. And what that means is when you look at the data repositories that exist in science and I'm going to say upfront that I'm biased because I work in biomedicine, so most of the data that I see are biomedical. But fundamentally we have a problem because investigators view their work as publishing papers. Nobody thinks that their work is to leave a legacy of reusable data. All the sponsors right now are getting very excited about fair and saying that data sharing is important. They don't really pay investigators to share their data and most important is where my emphasis is going to be today is that data are not going to be fair unless the metadata that describe data sets are themselves fair and getting metadata that are fair is really hard. So here's some sample metadata. This is a random record given to me by one of my coworkers from NCBI, the National Center for Biotechnology Information at the National Library of Medicine. This is for a data set that was put online by Genentech and this is what metadata look like. Usually there are sets of attribute value pairs and they describe the data that are online. They describe in particular what is the experimental situation that led to the creation of those data and this all looks pretty good when you look at it you know from a high level. When you sort of look at exactly what the investigator here is putting into the metadata, well carcinoma hepatocellular tells you what the disease is but it's not a standardized term. Saying that the ethnicity is Japanese is a bit odd. Saying that the age is 57 year not 57 not years plural is also odd and if you were to search for Asians with hepatocellular carcinoma you wouldn't get this record if you're doing a search online because the metadata don't give you the information that you need in a form that makes the metadata useful and indeed you realize that you enter that situation because most of the time when people create metadata, they create metadata by filling out spreadsheets and they're filling in spreadsheets that give them very little guidance, don't really tell them what are the control terms they should be using, don't really tell them what are the what's the expected information in one of these metadata specimens and what do you get is this. This is a look at the gene expression omnibus, one of the online databases at NIH and provides information about kind of high throughput biological experiment and if you were to look at records and wanted to find the age of the subject you would see a gazillion ways to represent age age age age pop age age after birth age in years and so on. Again if you want to search for records and find patients of a certain age not only you have to anticipate how the investigator may have encoded the metadata to describe the age of the patient but you have to think about what are all the variations about that what are all the typos that could have been introduced into metadata and it becomes a real mess and in fact we looked at the biosample repository at NCBI this is the repository of metadata regarding the samples that are referred to in other kinds of data at NCBI which is supposed to be their best data set and we'll say despite the best efforts of NCBI and the NCBI tries really hard and despite the best efforts of the investigators they try really hard well 73 percent of the Boolean metadata values are not actually Boolean their value is like non-smoker and former smoker. 26 percent of the integer metadata values can't be parsed as numbers there are things like jm52 and pig uh 68 percent of the metadata entries that are supposed to represent biomedical ontologies don't and so what we have is a situation where if you're an investigator trying to do data reuse and you want to search online datasets at least in biomedicine at least using NCBI resources you're in a situation where despite all the best efforts of everybody involved the metadata don't make it easy for you to do that and I have colleagues at Stanford who are really good data parasites so to speak they spend weeks and weeks going through metadata by hand because the online search capability is so impoverished and so that gets us back to the question of how do we make our data fair and honestly we're in a situation we recognize that at minimum we need access to experimental datasets we need mechanisms to search for metadata to find the relevant experimental results we need annotation of online datasets with adequate metadata we actually know what investigators did we need control terms so that we're not looking for various spellings and typos and we need a culture that actually cares about all this where the investigators who create metadata recognize that really they are creating datasets putting them online because there are going to be new discoveries that will be made from their data and that they have a vested interest in making sure that their data are as reusable as possible well that's all good and how the real question is how do we get there we get there number one by making sure that we're using standard ontologies to describe what exists in a dataset completely and consistently and fortunately at least in biomedicine where I work we've had good ontologies around for about 300 years and so this is Linnaeus who created the system for speciation in biology which is probably what all of us learn in high school and this particular kind of ontology for identifying species of living organisms is just one example of the many hundreds of ontologies that are available in biology and medicine in some sense having hundreds of ontologies in biology and medicine is a problem but the good news is we at Stanford have been working for many years on a system called bioportal bioportal is an open online system that gives you access to basically all the publicly available biomedical ontologies that exist if you go to bioportal you'll see that we have hundreds of ontologies many of them are standard ontologies used frequently in clinical medicine others are more unique and created by investigators in very specialized areas of biology but we can go to bioportal we can take a look at what kinds of ontologies are available we can search them we can identify for example all of the myriad terms in snowmen CT that might be relevant for annotating a biomedical specimen and we can identify the terms that are necessary in order to make the metadata for our data sets more standardized and therefore more fair that's good but what about the actual structure of the metadata themselves we need to be able to describe experiments completely and consistently so the people who want to reuse those data can search them through the metadata in a way which is guaranteed to have maximum recall and the real question is how do we do that well for 20 years again in biomedicine the people who do high throughput experiments called microarray studies have recognized that because their high throughput experiments generate thousands and thousands of data points they're not going to publish the data in a journal they're going to put them online and they have to make their data searchable and to do that they recognized that it wouldn't be just adequate to put the data online they needed metadata that would clarify what was the substrate of the experiment what was the platform that was used what were the experimental conditions we're going to try to prove and that led about 20 years to something called minimum information about a microarray experiment or Miami and Miami is more than just a clever acronym it really is something that has revolutionized the way people in biology think about describing metadata because Miami said look it's not good enough to just say who you are and what your study was in very broad terms these are the kinds of things you have to say about your study if someone else is going to reuse your data and make sense of those data and Miami caught on like wildfire and then not only do we get Miami but the biological community recognized that for all the standardized kinds of high throughput experiments that they do routinely there was a need to describe these minimum information models that would clarify what the metadata needs to describe in order for some third party to actually make sense about what experiment was done and so we have Miata and we have Miriam and Minimesse and Misfichie and really dozens and dozens of these standards that describe what are the minimal things you need to say about an experiment in order for someone to actually make sense of what you've done and that is really important and it's not unique to biomedicine but it's a really important attribute about the way people in biology think about creating metadata which we could take advantage of in Cedar but you need more than just the ontologies you need more than just the kinds of structures for creating metadata you need the ability to do this in another way which is convenient to the people who are running the experiment you basically need to make it palatable to describe experiments completely consistently people who are doing experimental work love spreadsheets because they're so easy to use because they're so familiar but because they don't provide the information that's necessary often to do things in a structured way what we really need are things like Cedar which provide a web-based platform that allow investigators to describe their experiments in ways which really clarify for themselves what they've done clarify for the computer what they've done and obviously clarified for third parties who want to reuse their data what they can learn the data that have been put online so if you look at Cedar we can think of Cedar as an approach that has three steps in the leftmost panel we have an ability to create templates that describe metadata and these templates are based often on the kinds of community-based standards like Miami Misfishing, Minimas and lots of others that are used to be able to describe the information necessary to communicate what was done in a scientific experiment in the center panel we have technology that uses the template and fills in that template in order to make it possible to describe in detail what was done in a particular experiment and so we have the metadata template created on the left we have the actual metadata for a given experiment that uses that template created in the middle and at the very right we have the ability to export that information to some online repository like the NCBI repositories or import or CGA or a lot of other repositories and in the middle panel if you will this panel over here where we say we explore and reuse data sets to metadata we have the ability obviously to search metadata that had already created through Cedar but another element that we have which I'll show you in a minute is that we can use the metadata that have already been entered through Cedar to learn patterns in those metadata so we can use AI to understand structures in the metadata that can actually help us inform the acquisition of new metadata by knowing our old metadata and understanding the patterns there we have the ability to acquire information which allows us to ensure that the filling in these metadata templates is as easy as possible for the investigators who are trying to use them so here's what Cedar looks like when you log in you get a library of metadata templates you can see templates if you created for a variety of purposes suppose we're interested in biosample human that's the template that allows us to enter metadata for describing samples that are going into the biosample database at NCBI that refer to human subjects so we click on it we say want to populate it and here's where metadata template looks like in Cedar so we'll look cleaner than those spreadsheets that you saw earlier and it's a set of attribute value pairs there's a sample name there's an organism there's a tissue there's a sex there's an isolate perhaps there's an age and so on and we can see that in this particular specimen there was a person who was 74 years old who had dermatitis and there was a cell line that was created and more and more of these metadata will describe exactly what was done to create the sample that was used for a variety of experiments okay so how do we create the template that we fill in that allows us to generate these metadata well remember the creating of the templates is the first step in our three-step process and we have a whole easy to use web-based offering system that allows us to describe a template in this case we're entering information that describes the biosample human template that we've been talking about we can say that there's a sample name which is alphanumeric there's an organism which is alphanumeric there's a tissue which is also alphanumeric but here we can say we just don't want this to be a type in we want this to be something where the value is selected from an ontology that already exists in that bioportal resource that I showed you earlier and so the triangle symbol that you see at the left-hand side of the information presented here suggests that we're going to be using an ontology to create the values we click on search if we like and what we can find out is that if we were to go to bioportal we would see that there's a lot of ontologies that talk about tissues the best one is at the top that's uberon and it suggests that uberon might be the ontology you might want to use if we're going to ask a user down the road to enter information about tissues and then we can go look and see what uberon looks like we can see that uberon seems to have what looks like good selections for tissues and so we can say okay let's go add uberon and so we've added uberon to our template and now if we want to actually annotate the template if we want to edit our data with metadata by filling in the template we can see how to do that we can see here's the biosample human template I showed you earlier we can see that uberon provides information on the tissue through a drop-down menu and so instead of having to type in some random value for tissue cedar makes it really easy to say okay the tissue is going to be taken from uberon and not only i'm going to show you the tissues that are in uberon we're not going to show you a few dozen tissues and make it really hard we're going to use our knowledge of the previously entered metadata to put to the at the top of the drop-down list those values that were selected most frequently by people who previously entered biosample human metadata and talked about tissues so by looking at our previous metadata we can see that for example blood was the most frequently used tissue in the in the samples that have been annotated previously and therefore there's a good good bet that blood ought to be at the top of our list now and we can see well let's maybe we can just go select it and if we look at other entries into this this template for example suppose we said the tissue was long and we want to choose what disease might the specimen or the subject have had from which we took the specimen the list for the possible diseases is not an unordered list of all the diseases in the disease ontology but they're ordered in accordance to what diseases have we previously seen when we've had metadata where the tissue was long and the species was homo sapiens and we see we get selections like like lung cancer chronic obstructive pulmonary disease and squamous cell carcinoma the kinds of things that we'd expect so we can make it easy for the user to fill in metadata values by taking advantage of our of our previous metadata entries and learning from them and doing our predictive data entry and reformatting our menus to make it really easy just to fill in these blanks now is this going to be as easy for investigators and spreadsheets people still love those spreadsheets a lot but this approach has turned out to be really easy for us and easy for our collaborators when we say the tissue was brain then they get a new dropout they get Parkinson's disease and CNS lymphoma and autistic disorder other kinds of diseases that would be associated with the brain and for every selection made through a template in cedar we may get really really easy by trying to learn as much as we can from previous metadata entries making it simple to click off menus or enter other strings with with predictive data entry making it really fast and not only fast but very specific and very accurate to create comprehensive detailed metadata that becomes searchable and become usable by other investigators so things that are important to remember about cedar from an AI perspective is that all the semantic components that you see in cedar the template elements that build up templates the templates themselves all the ontologies the value sets that are used to fill in the blanks all of these are managed as first-class entities most of them are stored in the bioportal resource and make it possible for us to upload new versions of them and to edit them back through that mechanism and the user interfaces take advantage of all these semantic components by generating on the fly drop down menus whole forms basically everything as you see in the cedar UI for entering metadata is created on the fly through these semantic elements what that means is if you want to change the metadata you don't have to do any new program you don't have to do any new UI development you just change the template you change the model and from that everything else follows you get a new you get a new user interface and you fill in the blanks all the software components in cedar have well-defined apis which make it really easy to have cedar used by different clients they want to get access to different parts of the system and everything that you see here all the metadata in cedar is translated to JSON-LD which is really convenient because we can translate that to RDF if you prefer RDF and lots of other formats that make it really convenient to use a standardized representation to get access to all the information necessary to really understand the metadata that investigators have associated with their data sets so we don't have spreadsheets anymore we don't have to worry about making the mistake of putting things in the wrong row or column or not have the access to the right ontologies we don't have to worry about a confusion of information because people are not constrained to enter ontology information but still can only put in information related to the ontologies that the template designers have suggested are appropriate and we're in a situation that we have lots and lots of folks using cedar giving us feedback telling us that by creating information through these kinds of templates using ontology terms doing it in a web-based manner they're able to create metadata that they believe are going to be more useful than the kinds of poor quality metadata that we that we know are extant in the biomedical resources that frequently are associated with scientific data sets so I would argue that our online data are never going to be fair they're not fair now they never will be fair until we can identify the kinds of reusable templates that will give us in a standardized form what we want to say about biological experiments in particular and scientific experiments in general in order to be certain that we're saying everything that we need for someone who's reusing our data to get information about what experiment was performed we're not going to have fair data until we can use control terms to fill in those templates so that our ontologies are giving us the guidance that we need to ensure that things are represented in a consistent way across experiments and when you need technology we need technology that's going to make it easy for investigators to annotate their data sets in these standardized searchable fashions and frankly frankly we need more than technology we need a culture we need the scientific enterprise to recognize that data we use is important that fair data are important and that none of this is going to happen until we develop an infrastructure that will make it easy for investigators to be able to create the metadata that make their data sets useful, discoverable, and fair so like the research parasites of the world people like Kravish Khatri, my colleague at Stanford are going to have the opportunity to be able to learn from data sets that are cleaner and better organized and more searchable if we can make the metadata better by using technology that gives us standardized templates and standardized ontologies and the way that Cedar does scientists are going to be able to recognize that they can actually use intelligent agents to search to find new experimental results and other investments performing in a way which is much more specific than searching the literature which right now which only gives us access to abstracts or perhaps full text but always with the limitations of natural language processing with the opportunity of looking at the actual metadata that investigators enter in ways which can give them more information about the experiments that are being performed more details about experimental structure that is really important when they want to know how to expand their research programs and clinicians who want to get access to clinical data will be able to understand better how their subjects might be able to have situations relating to the subjects of those who are in clinical trials in a way that helps us to match better the clinical trials that are available online with the subjects who may have different kinds of conditions that we know what is the best scientific evidence that is available in order to know how to treat patients with unusual conditions. Overall we see this technology and what it will spawn as a mechanism whereby we can have investigators create very detailed very structured very machine interpretable descriptions of their experiments and add those to the kinds of metadata that are used routinely to describe experimental data sets online and when that happens technology such as cedar will allow the automated publication if you will of scientific results that go beyond the kinds of information that we have in our current online journal articles and instead give us machine readable information at a level of detail that will allow our intelligent agents if you will to search and read the literature represented in the form of online metadata integrate this information with existing data sets in ways which is not possible now track scientific advances again with a level of specificity that goes beyond what is possible when just looking at the literature to re-explore existing data sets and because all this is going to be machine processable those agents can suggest what are the next experiments to perform and given what's happening in robotics these agents may actually be able to do those experiments on their own have to see what happens there but any that we see cedar as a mechanism which allows us to move beyond spreadsheets to move beyond poor quality metadata to move beyond data sets that are not fair and create data sets that are basically not only fair but contain the kinds of structured comprehensive information that makes it possible for investigators to reuse data in ways that were never possible and to create new data sets online easily so that the scientific community can benefit from the work that's going on basically throughout let me stop there and see if you have any questions thank you mark um yeah please anyone who has any question that just speak up or feel free to put them in the chat and i can read them out hey mark that was a great talk um i wonder is cedar available across fields or is it especially optimized for health data sets um that's a good point uh cedar is not biomedically specific so obviously being in the school of medicine and being surrounded by biologists uh we use cedar all the time in this area and because most of my funding is from the NIH our collaborators tend to be biomedical but there is nothing specific to biomedicine here everything is sort of generic technology using standards semantic web approaches what we can do and what we have done on a small scale is allow ontologies in other areas to be stored in bioportal and people are creating metadata templates in areas outside of biomedicine so for example my colleague john graveel is working with a bunch of engineers in denmark who are interested in uh physical science and climatology and we have a whole series of metadata templates and ontologies in cedar now that deal with collecting data from marine-based windmills that doesn't always sound biological to me and it shows you that this this kind of system is really quite open and uh one thing I may not have emphasized is that my team is really eager to collaborate with anybody who may view this kind of work as valuable and we love to see cedar being used in other areas because we think the ability to have arbitrary ontologies stored in our ontology repository as well as templates for variety of scientific disciplines in cedar really would give us new ways of studying this kind of work well as you see in a moment we have data infrastructures for learning data and the metadata challenge is one I totally appreciate and I just sent my my software team a link to your website so thank you yeah I know and if you want to collaborate just drop me a line that's great thank you mark I have a question for you so uh see it seems awesome um if we is it an open source project if we wanted to contribute to it um it's yes it is an open source project you can see everything in github we're trying to make it increasingly modular it's not engineered in a way which makes it really easy to plug in new components but again we're willing we're very eager to collaborate and so it is certainly something where we we can work on the software engineering as well as applying it in new application areas awesome thank you any other questions anyone great well we're running we're actually running ahead of schedule can you're up in five minutes so if anyone wants to take like a five minute bio break we can loop back around in five minutes and uh Ken uh you were absolutely going to share your sides Ken I think you're on mute yeah same old zoom error yeah I'll test it just to make sure yep we see it and there you go great perfect all right so let's start at 925 or sorry I guess that's uh 1225 1225 great hey Ken let's uh so you're on mute I'm gonna start with your introduction so Ken Cotinger hopefully I pronounce our hate um is a professor of human computer interaction and psychology at Cunningham University uh Dr. Cotinger has a ms master's degree in computer science and a phd in cognitive psychology and experienced teaching in an urban high school his multidisciplinary background supports his research goals of understanding human learning and creating educational technologies that increase student achievement his research has contributed new principles and techniques for the design of educational software and has produced basic cognitive science research results on the nature of student thinking and learning Cunninger directs learn lab at learnlab.org which started with 10 years of National Science Foundation funding and is now the scientific arm CMU Simon initiative learn lab built in the past six of cognitive tutors an approach to online personalized tutoring that is in use in thousands of schools and has been repeatedly demonstrated increased in achievement for example doubling what algebra students learn in the school year he was a co-founder of Carnegie Learning Incorporated that has brought cognitive tutor based courses to millions of students it's just formed in 1998 and leads learn ad now the scientific arm CMU Simon initiative Dr. Cotinger has authored a 250 peer review publication and has been a product up a project investigator on over 45 grids in 2017 he received a human professorship of computer science and in 2018 he was recognized as a fellow of cognitive science so Ken off to you well thanks for that wonderful and thorough introduction appreciate it so i'm going to talk about use of data to understand learning and to try to improve it particularly through implementations of educational technology and in the process i'll be illustrating a couple of data infrastructures that we've created with generous support from NSF including a data shop and learn sphere and i invite you to go to those websites and try them out yourself so the the key messages i want to make today are first uh that in education there really are a vibrant set of activities around data discovery and reuse and it started in older fields of AI and education and learning sciences and in AI and education there's been a lot of effort to create so-called intelligent tutoring systems that mimic human tutors and and in in the last 10 years or so new related fields and associated conferences have emerged including educational data mining learning analytics and learning that scale so there's lots of activity a lot of that activity is around uh doing analytics that creates better predictive models and a lot of it stops there but i really want to emphasize the importance of going the next step what we call closing the loop and using discoveries to actually redesign systems and make predictions and test those predictions in randomized controlled experiments a couple discoveries i'll illustrate it is one uh looking at how students engaging in learn learning by doing activities appear to learn much more than uh they do by watching a lecture video or by reading text but more specifically i want to probe our efforts that efforts to optimize that learning by doing process by uh using learning curves to discover hidden skills that lead to better educational technology and then to better learning and i'll summarize some of these efforts popping up to our web-based infrastructure that's designed to make sophisticated analytics easier for social scientists and educators who don't want to necessarily write python or our code as a bit of background uh we've had a long history of creating educational technologies including the math tutors that you heard about in the great introduction um these have been both widely used but also widely evaluated and uh and perhaps the one of the biggest educational technology randomized field trials 140 schools where either 70 were randomly assigned to use this cognitive tutor algebra course that we had developed where a good chunk about 40 percent of that course involves students interacting with our intelligent tutoring system our cognitive tutors but we also used cognitive science to develop the text materials in teacher professional development so it's kind of big package kind of investigation the other 70 schools used their traditional algebra course materials and this graphic is meant to illustrate summarize a key result that the learning gain over school year was essentially doubled for students using the cognitive tutor algebra and we've done similar development efforts and evaluations with uh online college courses the open learning initiatives here at Carnegie Mellon University has produced lots of online interactive learning materials learning by doing opportunities and one particularly impressive result was from a statistics online course that was uh adapted through data to optimize student learning such that it was taught in a half of a semester and led to greater uh learning gains both on final exams as well as these these percentages of learning gains are a standardized um assessment of concepts in statistics so if you know the physics education research is a little bit like the course concept inventory in physics this is a similar general test in in statistics the learning gain uh on the the standard exams i'm sure was much bigger but this shows that not only you got better learning gains in shorter time but better transfer as well so uh these educational technologies provide uh lots of opportunity to explore learning by looking at the what's sometimes called the the data exhaust of student interactions with these systems and the OLI psychology course was used as part of a MOOC that was developed uh at Georgia Tech where the lecture materials were delivered um through Coursera um you know with videos um on this Georgia Tech psychology lecture but the online reading materials and importantly the interactive learning by doing uh experiences these are essentially formative assessment questions those were provided by an open learning initiative and you can see an example here of of an activity uh related to dimensions of personality where students get feedback as they drag and drop these different dimensions into this table if they need they can get hints as well so the instruction is embedded in the context of doing as students get a sense for you know what they know and don't know and then can adapt so we looked at the thousand students that completed the final exam in this course and their variation in watching reading and doing and used some causal modeling techniques developed here at Carnegie Mellon to build a causal model of the relationships here so controlling for pre-test do students who engage in more of the bottom these activities um do better across the 11 unit quizzes in the final exam as compared to students who watch more videos or do more reading and these uh are standardized coefficients from making each variable essentially a z-score so this indicates the effect size of of an extra standard deviation in doing on the total quiz score and then a big effect size of of the quiz score on the exam and essentially the summary result to hear is that learning by doing i'll produce positive so more of all is good but learning by doing in particular produces six times better learning than by watching or reading uh so uh there's a lot of other learning science research suggesting that various forms of learning by doing one of them called deliberate practice or highly effective but it really depends on how well tailored the activities are to students needs and they should be designed to address the edge of students competence and uh you know we all have our sense of our own learning but it turns out much of what's going on in the learning process and even the thinking process is below the surface of our conscious awareness it's been estimated that as much as 70 percent of expert knowledge is outside of their conscious awareness so that's a huge opportunity for data to help us gain insight into what's really going on underneath the surface when we were building the algebra cognitive tutor i was interested in in exploring why story problems are alleged to be so hard and and built a set of assessment items that are matched as these are we also surveyed math educators and math teachers who suggested as i thought at the time that the story problems and the word problems would be harder in the sense that we gave students these kinds of problems their their performance the percent correct would likely be lower on these two than on the equation but it turns out that's not what the data said uh what we found and we replicated this numerous times um uh is that in fact the equation was the hardest for beginning algebra students there are some nuances on the form of these problems that will change these results in insensible ways but this result is was striking and important for our design of the algebra tutor but more generally illustrated this idea that experts have a big blind spot with respect to what they know and don't know algebra teachers do not necessarily see into the hidden skills that students need to acquire to be good at equation solving and a lot of it goes to essentially algebra as a language so skills for seeing the grammatical structure that the multiplication needs to happen before the addition so you can't add the 1666 or the asterisk means times or in the 6x format that the juxtaposition means times the semantics the syntax the the grammar the lexicon of algebra are something our brains are very good at learning by doing but we don't necessarily realize how much work happened to get us there and so as experts we think this is clear and obvious it just pops into our brain but it turns out that isn't the way it works for novices and there's a lot of work that we could help the brain be doing by better optimizing the instruction along the way so the approach we've taken is represented in this loop where we start um um in this particular approach with data from an existing educational technology system use it to discover these hidden skills design better instruction to address those hidden skills and then to deploy a new version of the system in comparison with the old to confirm in these close to the loop random assignment experiments that we get better outcomes so i want to walk you through one of those and we've done a number of these now but the source is an intelligent tutoring system here's a screenshot it's pretty old screenshot from a unit on geometric areas and one of the challenges in this context are getting beyond helping students work on problems that aren't sort of point solutions a simple formula answers the problem where they have to combine multiple formulas to come up with the solution and here they're asked to to figure out the area that's left over when this the end of the can is cut out of this metal square and here this table starts off empty the students working through the intelligent tutor is tracking their performance they make an error at this point they ask the system for help the system says at this column wasn't there the tutoring system suggests they add a column to first find the square area and then add a column to find the circle area and then come back to this so all that data is being logged and each step here it can be coded with respect to progress and learning so this here's the general frame of an error rate learning curve what we have on the x axis how many opportunities which opportunity is this for a student to display their competence in one of these formative assessment activities when they if they do struggle they're going to get hints or feedback so they'll get a these are opportunities to assess but also opportunities to learn and what we'd like to see is that the average error rate across students and across components of of competence across skills and concepts that that average error rate goes down and if we look at a learning curve straight out of a course where we don't code it by any particular topic or concept or skill just by the order of each activity experienced we get basically a mess we don't see a learning curve the error rate may be going down for a while and then it blips back up and it goes down and blips back up that generic level of coding the data as though there's just one component of knowledge geometry does not lead to a smooth learning curve and i'm now showing you a screenshot of data shop i mentioned one of the infrastructures we built with NSF's help if you code this data with one component you don't get a smooth learning curve but if instead the data's same data is recoded we'll say these 12 components not where this flip up here is maybe when trapezoid areas first introduced when we recode it in this way we now can re-average so this was the 30th opportunity of geometry might be the first opportunity now of trapezoid and the red data summary here shows the decline in error rate associated with each opportunity to practice and the blue line is a logistic regression model that growth model that models both the contributions of the difficulty of each of these knowledge components the rate at which each is acquired and the student's overall competence to create a reasonable fit to the data but the key point is this very general contrast here that I made can be used more specifically to probe each one of these 12 components to see if it's showing a smooth learning curve and when we do that we see that some of the individual curves even in a reasonably small data set here i think this was 50 some students we get good learning curves for some components actually start off with a low error rate so we could improve the system right away by eliminating that busy work but what i particularly want to focus on is those that look like that overall curve it's got a lot of these upward blips in error what's going on well these this particular compose skill labeled the step i illustrated earlier of for example finding the leftover area when you cut a circle out of a square and there were tons of problems like this sometimes it's adding a triangle on top of a square there are various versions of it but what we saw is that error rate varied quite a bit all of these steps here are essentially the same idea the same procedure as taking two areas and and subtracting or sometimes adding them but the error rate was much higher in this one very low for these and sort of medium in here and the key insight that we came to is that when you provide scaffolding for students which is meant to help aid their learning they actually it's sometimes over scaffolding they don't have to do as much of the planning work with the scaffolding provided here they have to demonstrate that they can do the planning on their own this scaffolding is often used as an instructional manipulation but may not be very effective and it turns out we can model this more formally these discoveries and i hope you're tracking the loop that i'm going around we deployed we have data we've made a discovery but now we're going to redesign so a particular strategy is when you've got a hidden skill that students need more practice on design tasks so that's the thing they practice and this is the key deliberate practice idea certainly popular in athletics where we'll practice kicking the soccer ball into the upper right corner with the goal but this works for cognition too like in reading uh phonics is a version of this kind of focus but here the particular focus is we're just going to ask them to plan a solution we don't actually have to execute it and when we do that we see in the treatment now this is the closed the loop random assignment experiment that we can reduce a lot of that time they're spending on individual formulas most of which they're pretty well mastered we might have overdone it a little bit but we dramatically increased the time they spend on these planning steps overall they're spending 25 percent less time some people say who cares about time but if you think about that at scale 25 percent less time for learning means three years for college rather than four and importantly we get a positive effect on better learning so we've been through this loop a number of times now and we've built some automated AI search methods to facilitate it in those papers about that they also lead to better close the loop outcomes but let me just I didn't start a timer here but I hope I'm doing okay with time I just want to can you have about a minute left a minute left okay just say a little bit about data shop we built first and you can share data in any format but if you shared in the standard format you get a whole lot of these analytic tools for free LearnSphere has been an effort to allow folks to share analytic components and you can go to learnsphere.org and see that in particular we have this web-based workflow offering tool where there's a menu of analytic components that can be dragged out here and configured in various ways and the connections between these components are data table data flow and the user can adjust the inputs this is comparing actually different statistical models for these learning curves and then see the output so importantly we're helping educator education researchers psychologists folks again who want to do these kinds of analytics as well as helping folks who are developing new analytics share it by creating those analytic components and uploading them into LearnSphere so I will stop there and see if there's any questions right thank you so much Ken any other questions no questions Ken are you going to be available on the slack or on the zoom chat after this um I guess there's a session immediately following this that yeah um I can uh go go to the uh open area I think I I'm forgetting what it's called but I've been in the environment before great awesome so if anyone has any questions Ken please uh please do there and if you want to get down with the platform we can also go to gather town later gather town is what I was trying to think of earlier yeah yeah great yeah I'd like to introduce you oh sorry sorry yeah um so I'd like to introduce her next speaker um it's Dr Adriana Kalvashka so uh Adriana is an assistant professor in computer science at the University of Pittsburgh she was my professor when I took her computer vision class uh her research interests on computer vision and machine learning she has authored 18 publications in top tier computer vision artificial intelligence conferences and journals like CDPR, ICCV, NeurIPS, you know all names and 10-second tier conference publications like BMVC uh you know ACCV she served as an area chair for CEPR in 2018 in 2021 NeurIPS, ICLR, AAAI she's going to serve as the co-programmer of ICCV 2025 which is planning way out there uh she has been on program committees for over 20 conferences and journals and has co-organized seven workshops and her research is funded by the NSF, Google, Amazon and Adobe Adriana welcome, I'm excited to have you here uh thanks for the introduction uh very much I got a weird zoom warning hopefully you can see my slides yes good to go yep no okay cool uh yeah thanks again for the sorry for the introduction uh and for inviting me so yeah my research is about doing two things that they're kind of hard to make compatible but one is understanding uh media and intent and persuasion in the media and the other is actually using weak supervision to to accomplish this but along the way I've also collected plenty of uh not so weakly supervised data sets um so probably don't need to convince you especially you know nowadays as we have things coming up that the media affects public opinion and the societal outcomes like elections and whatever follows after those um so we want to understand implications in the media in other words we want to understand the agenda the different media uh content has um so we've looked at uh earlier on we looked at visual advertisements and then we looked at images and texts and political articles and the challenge here is that data is limited so you can take all kinds of pictures of dogs and cats and mountains um and we have lots and lots of those but images with intent and with some kind of agenda are not as abundant uh and of course also annotating them is is expensive because they appeal to a human audience with all of the human audiences uh you know knowledge acquired along the way so it's an expensive to annotate them with all the knowledge that these require to actually properly analyze so the goal is to learn some useful models from whatever data is already available uh even if that data is noisy and at least kind of where i'm trying to take this reach the research nowadays um so just a you know motivating uh image a motivating slide with some images that are um very powerful and impactful in their time uh like the first one is from the 60s second one is from the 90s basically these are said to have to have changed society in some way um is an over summary i i got interested in looking at advertisements and i have some examples on the bottom uh because even though you know you might dislike ads because they're trying to manipulate you they're actually very some of them are very interesting uh and creative and require basically ai to be solved to actually understand so we're not going to really understand them completely but we're trying to get some of the way there so we argue that state-of-the-art vision systems are inadequate to describe the messages hidden behind purposely created ads such as this one that you have here so hopefully it's somewhat clear what this you know what this ad is saying i'll have our ground truth answer um in vision systems we know nowadays are are much better than it used to be you know five or ten years ago and we can fairly reasonably annotate them with recognized concepts or objects or even generate full sentence descriptions however these descriptions miss kind of the the point of the ad they missed what the ad means which may be footed burger king must taste really good since even the competitors and employees secretly buys it so you have to recognize ronald mcdonald from his shoes maybe in his hair and you also have to recognize and this is requires common sense uh reasoning you have to to recognize that he's secretly buying it because he's wearing this this trench coat he's trying to be in disguise and you know vision systems don't can't do this just because they were never trying to to understand meaning so this is our goal to understand meaning and intent of these ads some challenges to understand what these all of these various ads and public service announcements mean uh a purely visual challenge is that a non-trivial fraction of them show objects in very creative ways very typical ways so a standard vision system trained on image net or whatever photo realistic data set you want is not going to do well on these so part of our research is focused on developing um robust representations that actually generalize across these you know modalities such as photo realistic or or art painting or whatever so this is kind of a an offshoot into a more mainstream direction of of domain adaptation and generalization there's uh also a visual challenge uh is there's implied physical processes in these ads like melting and crushing more in the reasoning end there's associations that humans have acquired over the years like guns are dangerous and maybe we get from media or from experience i'm not sure some of them are from experience but not all of them so guns are dangerous china is fragile uh you know hot sauce is hot and and oven mitts are hot but in a different way so these are all the challenges um to get started uh a few years ago we collected this advertisement data set which for something you know it's not as large as the the numbers you've seen for other image data sets but for something that requires actual human authorship um and you know significant thought it's it's pretty reasonable so we have about sixty uh five thousand images with various annotations that we crowdsource with various mechanisms to ensure quality uh so this is for images and we have another one for videos but much smaller and you can think of these videos as being stories that we can uh we can analyze um is an example of a task we'd like to solve on this data set we want for um the vision algorithm to multi-moderate the algorithm to match an image like this where you see a crash motorcycle so this is more of a psa than uh than a product ad so we want the algorithm to match this image with text somewhat like this so i should be careful on the road so i don't crash and die as opposed to something like this um which is in terms of objects mentioned it's it's close but it's actually the opposite so clearly this very somber ad is not trying to say biomotorcycle is go very fast because that would be a happier image so we have some way of representing regions in these images the slogans so the text that's embedded in these images here we have um something cool that we um we focused on which is symbolism in ads so you have um you know trees and grass and fruits symbolizing nature and then you have guns and bullets and knives symbolizing danger and ads kind of appeal to these the symbolism a lot and you know this is part of what people study in media studies is how uh symbolism and in visual rhetoric and then as a way of transferring knowledge and transferring data essentially from uh other data sets we we don't have on this data set we don't have objects annotated but we do have we can generate predictions from another model so we get these dense captions that give us a proxy for what the objects in the image are and basically what we do here is we we put all these things together to get an image representation and then we do metric learning where we try to sorry where we try to bring this image representation close to the representation for the correct piece of text um as a high level idea of what the results show basically if you look at ads so ads have the the image part but they also have a slogan embedded and these results show that the text is more useful for decoding the message of the ad and the message here uh I should have made this more clear is the message is basically this um it it's what you're supposed to do what the ad calls for you to do which is I should be careful on the road and then it also provides a reason um and the reason is so you don't crash and die so we want the system to be able to retrieve this message and if it uses the slogan alone it's much better than using the visual alone just because the visual is very ambiguous um I can skip that uh on video you have you similarly to images have visual and textual channel the textual here is speech and they generally find that um if you just use a speech in videos to try to predict the message it's actually a lot less useful so speech is is less cleanly mapped to the message of the ad or vice versa so speech is still more ambiguous than slogans and static ads. Here's an example of how our method can correctly retrieve the right statement for this so I know there's boxes are related here but the boxes are kind of importance regions and the ad is actually this lady's putting on uh lipstick but it's actually a cigarette and so this ad is meant on first site to look like uh you know maybe a beauty ad or makeup ad but actually if you look closer um that's kind of where the you know power of ads comes in is that there's some kind of twist um so the twist is that it's a it's a stop smoking kind of ad and our method can correctly figure that out but um in terms of using less knowledge which was my motivation here um you know humans have a lot of contextual knowledge a lot of world knowledge usually these ads are not targeting newborns they're targeting adults that have a lot of experience and so this experience we our current method doesn't really have that we don't want to just learn it with supervised learning we want to be able to retrieve it from you know just like humans rely on a knowledge base in their head we want to kind of um utilize such a knowledge base as an actual knowledge base here we look at dbpdia the problem with that is that a lot of um information you can retrieve about any entity is not going to be relevant so if you try to retrieve any information about Nike the sports company here you get Nike that's good but you also have Nike the asteroid Nike the the Greek goddess and so on and so forth so we have um as as a way of dealing with that our algorithm uh first of all can can learn which pieces of information are relevant and can also to make its training more robust kind of drop information at training time and by drop I mean entire words um to accomplish more robust training but the benefit of that is here we have a graph that shows your regions in the image and kind of parts of the slogans um so this is our method this is like a more basic version of our method so the full version of our method can more appropriately utilize external information like Chanel is a french privately held company or here um this is the deer made of trash and the slogan says rubbish can be recycled nature cannot so it correctly retrieves information about what nature means as opposed to getting information about say nature the magazine and so basically its use of external knowledge is is significantly more accurate than other methods um this the problem of understanding ads is challenging also because of the relationship between the images and the texts in these images so here is a you know more typical ad for clothing and it says winter collection and this is nowhere out of the ordinary we've seen lots of images like that but then this image also says winter collection and it's just the person dressed in a cardboard box so I can you know for the sake of time I'm not going to pause and ask if you know what this is about but basically this is a public service announcement against human trafficking uh this one is a typical clothing ad and so there's there's kind of a you know an interplay of vision language here that's that's interesting that prior methods haven't looked at because prior methods in terms kind of vision language methods have looked at captioning and captioning is just both image and text are kind of redundant but here image and text actually complementary and work together to convey a message um so we've looked at whether co just because image and text go occurred doesn't mean they're actually redundant or identical so we've looked at whether image and text are parallel or not and our goal here is to do this in somewhat you know weekly supervised or unsupervised way um we've also looked at generating ad-appropriate faces again here we're trying to reuse data from prior models so we we see in this example that faces are fairly you know distinct for different ad categories and we want to generate them but our data is diversity at very limited count so standard generation approaches don't work so instead what we do is we learn what we do learn on our data set is a very sparse attribute signature for each category such as domestic violence ad faces are going to contain something like black high which we think of as an attribute and then we learn so we learn the signature from our data set but we then we learn how to go from each attribute to actual pixel space on a entirely separate data set um and here's an example result that we get so as is an extension of from this fairly kind of niche and focus space of ads we've gone back to something more mainstream where we try to actually discover regions and learn about objects from weak supervision so here's an example that kind of motivated this but basically we were able to discover in our ads data set these orios and ketchup and so on so it's like we learned an object detection model without actually having data specifically for this without having any boxes as annotations so we have this approach where we try to learn from unstructured texts like captions in this case so from unstructured texts that by no means mentions all the objects in the image and definitely doesn't mention their locations uh we learn to actually localize objects such that at test time we actually can provide a box with with rather non-trivial and fairly competitive um accuracy by learning to model the the ambiguity in text so here we have three captions for this image uh actually you know he has a tie none of the captions mention a time but we can you know we have a variety of methods for actually getting the late the pseudo label tie even though it's not in our caption and lastly um we've also looked more recently at so previous stuff was about ads um then more recently we looked at rhetoric in political articles which are multimodal and what we want to do is we want to give it an image predict whether it comes from a left leaning or right leaning media source so the we have images in pair length the articles but we just want to predict bias or leaning from the images uh for training we're going to use weak labels uh such as the bias of the media source and we know media media source biases from uh from this website so we only have labels at the source level and they may or may not be correct or relevant for any particular instance um we have a dataset of about one million unique images with pair text so we are going to use a text but we're only going to use it at training time so we're not going to use it to classify a test time we're just going to use it as a on auxiliary modality auxiliary feature um so our approach basically uses the text so so our observation is that bias in text is uh more obvious so we're going to have this two-stage approach where um at in our first stage of training we do use text as an input and we uh basically learn an entire you know cnn including the feature extraction from images and text but then in a stage two we just retain the feature extraction part but get rid of the the text input and then we just learn a a single you know shallow classifier on top of the pre-extracted image features because here we're not trying to classify objects we're trying to classify left to right bias so actually it turns out features image features learn on image net or whatever is not are not actually very useful um and just as a as an example of our prior right for a generation and to show that there is actually significant visual bias in how the same person is shown on the left and the right we learn to generate uh faces of well-known politicians without any extra data uh so here are some examples of the photos that we're going to actually modify to be more left to right leading and we are going to make them very extremely left or extremely right just to show the effect more cleanly uh so here's our not so great reconstruction but what's interesting is kind of what happens if you take all of these pictures so here this is a smile here we kind of retain the pose you know his mouth is far down so it's definitely the same you know the same pose of the face but now it's our model learned that what's going to happen is that this same face is now going to look angry um on the left on the right where there wasn't a smile there's going to appear a smile or where there was you know even anger um there's going to appear um a smile so uh I do have only one minute so maybe I just want and the reverse happens on the right um we have a method that goes back to something more common which is retrieval which is given an analytical image retrieved text and we have this new metric learning approach uh but I'm going to stop there um and see if there's if I have 30 seconds for uh for questions any questions anyone you can put them in the chat or just speak out loud I have a quick question go ahead please uh I'm wondering uh uh whether feature engineering plays a role and I guess at some point you were talking about uh like expert input into this and yeah can you say a little bit about yeah so we um we could so a more strongly supervised setting here would be to like take an image and say which parts of this image make it left leaning or right leaning so if you have say uh this is kind of our silly you know made up example of um I don't know uh there's values associated with the left and the right um and so maybe if you see a table of people sitting and having dinner that's kind of a conservative value or whatever um so someone could take images and tell us like what's parts of them are are associated with a certain leaning uh but we don't want to do that we still want to at test time just classify things that actually have a bias rather than say classifying images of cats as being left or right leaning so at test time we do have they're not actually um expert labels they're proud labels but we ensure uh consensus in those um yeah there's a lot of like media literature that we could use here to kind of learn better features but we our motivation was not to do that we do have a baseline that's a little bit more strongly supervised with with actual concepts that are associated with left and the right um but we do comparable to it with this this week's supervision thank you for the question yeah thanks Arun thanks Adriana that was a fascinating talk I'm like amazing work as always so uh we're gonna break now for lunch uh and a poster session on gather town so uh about until 2 10 p.m eastern uh we're gonna be on break so uh flew for to hang out here flew for to hang out and gather that town but um let's meet back at uh 2 10 for a fireside chat with the one and only margella bear