 Let's see. Science is a thirst for knowledge. Science is a process. A process to help us think better. It's asking the right set of questions. Science has no political ideology. It's not liberal. It's not conservative. It just is. Scientific thinking is good thinking. We are facing all these complex problems. Climate change. Hunger. Infectious disease. Injustice. How are we going to address those problems? We need to discover faster. You can't give up. You've got to be skeptical. You find methods to try and prove yourself wrong. If a scientist is not open-minded, you are not a scientist. We are never happy, and we never should be happy, because things can always be improved. We have particle physicists. Designers. Engineers. Economists. Chemists. Biologists. Ethnomusicologists. It's really a community endeavor. All the different talents, all the different mindsets, backgrounds, experiences. That's just the mission we have. I am in IBM Research because I love the people I work with. I admire the people I work with. They're incredibly good. And because they're so good, I can be good. There's never been a better time to be a nerd. And that's the beauty of it. Welcome to What's Next from IBM Research. I'm Shaheen Parks, your host and moderator today. This is the first in a planned series of seminars, which will provide a window into some of the exciting work happening here at IBM Research. The What's Next seminar series will cover topics in the areas including artificial intelligence, hybrid cloud, and quantum computing. The format of these sessions will be 30 minutes with some of our leading researchers divided into two parts. The first half of each session will be technical talk, which will be followed by a moderated discussion. Please feel free to use the chat feature to ask questions or provide comments throughout today's presentation, and then we'll be able to include them in our discussion. Now, I'm very happy to introduce Pin Yu Chen. We'll be discussing his work on reprogramming large models with limited resources. Dr. Chen is a research staff member here at IBM, where he is the chief scientist of the RPI-IBM-AI Research Collaboration, as well as the principal investigator for several ongoing MIT-IBM Watson AI lab projects. He holds a PhD from the University of Michigan, and he was awarded an IBM Corporate Technical Award and Master Inventor in 2021. Dr. Chen's recent research focuses on adversarial machine learning and the robustness of neural networks. His long-term research vision is to build trustworthy machine learning systems. Pin Yu, we're very excited to hear more about your work in reprogramming today. So we'll go ahead and pass it over to you. Okay, yeah, thanks, Chen Xia-Hing, for the nice introduction. Hi, everyone. So today I'm going to show you a very exciting and cool technology called model reprogramming. So I would like to start by quoting Sir Isaac Newton. He once said, if I've seen further, it is by standing on the shoulders of giants. So I personally think this sentence will describe the advance we have observed in our AI technology and advancement. So in some data-rich domains like text, image, and speech, we do have the luxury to have these large pre-trained models to help us solve a lot of in-domain challenges. However, one interesting question to ask is are we able to maximally leverage these pre-trained models to solve problems that we encounter in the new domain, especially resource-limited domains where no pre-trained models are available and also the data could be limited, such as the medical imaging, molecule learning, and time series. So throughout this talk, I will show you what is model reprogramming and how it works and also provide some concrete examples to show the success of model reprogramming. And finally, I will explain why model reprogramming works. So eventually I hope you will get to the conclusion that model reprogramming is indeed a very data-efficient way to do transfer learning without model fine-tuning. So the power of pre-trained models has recently been given a new name called foundation models. So it was adopted as the one for all solution for very recent AI technology and advancement. So what is foundation models? So in a nutshell, it's basically a high-capacity neural network model pre-trained on a large-scale data and then can be efficiently fine-tuned on several downstream tasks. So one typical example would be GPV3, which is by far one of the largest language model humans has ever trained. So it consists of 1,000 billion tokens collected from multiple text sources like Wikipedia. And the model itself has more than 175 billion training parameters. And of course, its memory size is gigantic and the estimated cost is close to 12 million to train such a model. So having invested so much in these large foundation models, one interesting question is can we continue to reuse these models of when we encounter new tasks or new research problems? So that comes out into the core question that we want to answer in this talk. How can we use foundational models, especially in resource-limited settings? So income tries to resource-limited settings. In standard setting, we often adopt the solution of pre-training plus fine-tuning because we have sufficient data to pre-train a large model and then can be efficiently fine-tuned to in-domain tasks. Like you pre-train your language model on Wikipedia, they have sets and can be used to answer question answering or a natural language understanding. So in the standard setting, these foundational models is very similar to the role of a Swiss army in life. You can repurpose the efficiently for different tasks. However, in the resource-limited setting, it's basically a new domain with limited data or even compute power. And usually in that new domain, we don't have such a pre-train models. So in this case, pre-training and fine-tuning no longer works. So that's why we are proposing a new solution called the reprogramming and without fine-tuning the pre-train model from another domain. I also want to make a comparison between transfer learning and model reprogramming since a lot of you probably are familiar with the transfer learning. So transfer learning basically works through this idea of fine-tuning. So we take a pre-train model that was originally trained to solve a general task, like a general image classification problem, per se. And we are going to take that pre-train model and fine-tune some parameters in that model with respect to a specified task. In this case, a classifying different dog species. So that's usually how the fine-tuning works. However, a model reprogramming works in a very different fashion. So first of all, it features cross-domain learning. So here we are really trying to reprogram a pre-train model from a domain to solve a resource-limited task in a totally different domain. We also want to highlight the data or compute efficiency of the model reprogramming because that model reprogramming does not need to fine-tune the pre-train model way so it can be more a data efficient. We didn't need to worry about how do we fine-tune the parameters of the size of billions. And even better, model reprogramming achieves state-of-the-art performances on several examples that we are going to see shortly. So before I go to the details, I also want to show some background to tell you guys how this reprogramming idea comes from. So this paper published at IKLIA back in 2019 actually introduces reprogramming through the viewpoint of an adversarial. So they were viewing reprogramming as some sort of attack to our machine learning system because the attacker can possibly reprogram a model to perform other attacks without the model developer knowing. So there were some concerns about repurposing the model to do something that it is not designed to do in the first place. But if we look closely at the performance, so here the authors show how do we reprogram pre-trend image models to solve problems like counting squares in an image or MNIST or C-part 10 classification tasks. For those of you who are familiar with these texts, you will notice the test accuracies are OK, but not among the greatest. So their conclusion is the adversarial reprogramming, this idea works, but it is still not clear how can we make most of the fact that the models are reprogrammable. So we follow up this idea and we ask the question, how can we do with the fact that we have foundation models in some domains? And we know those models can be reprogrammable. And that is to our first work called BAR, Black Box Adversarial Reprogramming, published at ICML 2020. So we would like to study a few things that was not discussed in the original paper. So first, if we can truly do model reprogramming without changing the weight of the pre-trend model, then ideally we should be able to reprogram a totally Black Box model. So here, Black Box model means the model details is totally unknown to a user. Like the Black Box API, we are able to use the inference function of that API, but we didn't know what is inside that model. We also want to check how can we do cross-domain learning and also in the data-limited setting. And that's also the other two factors that were not discussed in the original paper. So on the right, I'm showing you a typical example of how this Black Box reprogramming works. So we take a pre-trend model that was originally designed to classify general images. And we are trying to see if we can reprogram that model to solve domain-specific problems in medical image domains, like autism spectrum disorder or melanoma image classification. So these problems often have limited label data because annotations are very expensive in these domains. So next, I'm going to show you how reprogramming or Black Box reprogramming works. So remember, we have these great foundation models. It's basically a pre-trend model on an original domain. In this case, we are using ImageNet as a motivating example. So how reprogramming works is we look at the target domain, the tasks. We are going to reprogram the pre-trend model to solve. And then we introduce several new modules to make reprogramming works. The first module we introduce is what we call a reversal program. So you can think of it as some input transformation layer parameterized by W. So it's basically we'll take the target data input and output reprogram the input data. So in this case, we will basically input the target data at the center of the data input. And then we give some space on the boundary to make them trainable parameters. And that will constitute a reprogram the data input for the target data. Then what we will do is we will pass the reprogram data to the pre-trend model. So you will expect to observe the predictions from the source domain. In this case, the ImageNet labels. So the next thing we do is to map the source labels to the target labels we are going to solve. So we will map a bunch of source labels to autism spectrum disorder and the other set of labels to non-autism spectrum disorder. So in that case, that constitutes the forward pass of the reprogram model. So what's remaining is how can we optimize the ways associated with the input transformation layer to minimize the loss we designed for a target task. So in this case, depending on what scenario we are looking at, in the white box scenario where the models are transparent to us, we can just use a typical back propagation or any gradient-based method to solve for the W that controls the parameter of the input transformation. In the black box setting, we can use some technique that is zero-sword order optimization. Use function values to update and optimize the parameters. So that constitutes the back propagation stage. So notice that in this framework, we indeed did not fine-tune the pre-tran model at all. All we did is to add an input transformation layer at the beginning and also some label mapping function at the end and make the whole model end-to-end trainable. So next, I'm going to make some examples of the, to show you the success of model reprogramming. The first example I want to show is how can we reprogram a general image classifier to do autism spectrum disorder classification. So this is the task of a binary classification with each class having around 500 samples. And each data sample is basically a correlation graph of our brain regions from those patients. The source model we are going to reprogram is the ImageNet pre-tran models. So here, because we know the pre-tran models and architecture, we can either do a white box anniversary programming or a black box reprogramming by pretending it as a black box model. So in the table that you are seeing here, we are comparing our method to training from scratch and also transfer learning, as well as state-of-the-art methods on this dataset. So there are several things that are not worthy to be mentioned. The first thing is data efficiency. So we do observe that the reprogramming is performed much better than transfer learning or training from scratch. Because in this case, there are just only a few data to train or to find team. So it's better to do reprogramming rather than spreading your data to thin and having too many parameters to train and optimize. The second thing is effectiveness. So we do observe using this reprogramming idea, the performance in terms of accuracy actually outperforms the state-of-the-art methods, which requires a lot of complicated hand design features and also data augmentation. And we also show the practicality in the sense that even treating the source model as a black box model will apply our bar technique, we still get comparable performance to the white box adversarial reprogramming method. Next, what we are going to show is, we mentioned we are able to reprogram a black box model without knowing the details of the model. So here we implement that idea on this Microsoft custom vision API. So this is the API that allows us to upload a data and it will train a model for us to do prediction without letting the user know what the model is. So in this case, we uploaded a traffic sign recognition data set, which consists of 43 classes. And we reprogram that black box model to do autism spectrum disorder classification. So you will see that it only take about $20 to get a very accurate model down here. And of course, if we pay more queries to that API, we will get a more accurate reprogramming capability and so it will improve the accuracy at the price of increased cost. But overall, this reprogramming idea really works. We can successfully reprogram traffic sign classifier on the cloud to do autism spectrum classification. The next example I'm going to show is related to time series. So this is an other rich domain where we do have a luxury of at least a powerful pre-train model. So it's basically a speech recognition model that takes speech signal as input and it will generate voice commands as output. So very similar to what we discussed in the image case. So we are interested in reprogramming this model to do time series classification. Like this one, for example, we want to classify whether that time series is anonymous or not. So what we will do again is put this target data into this reprogram layer and we introduce some space to allow the train of our weights and obtain this reprogram input. And then we will put this reprogram input to the pre-train acoustic model and also need to map the source label to the target label. So it's very similar to what we discussed for an image case. And you can imagine how flexible and how general this method can be, right? So in a way, this method is agonistic of what the architecture of the pre-train model is, right? So it can be a transformer-based speech model. It could be a unit-based speech model. But all we really care and all we are really reprogramming is inserting an input layer to reprogram the target input and also inserts an output label mapping function. And that's the only two modules we add to make this acoustic model can be reprogrammed to do time series classification. And on a popular UCR time series data set, our method actually outperforms there we are on 20 out of 30 tasks. So this really provides a lot of encouraging news and also excitement for us to apply this general and flexible technique to solve problems in a new domain with limited data. So finally, having seen the success of model reprogramming empirically, I would like to provide some theoretical justification on why model reprogramming works. So the first intuition that may come to our mind is there must be some knowledge transfer between the pre-train model to the problem we are solving. But this really cannot fully explain why we can see this reprogramming works across two very different domains. So we look into this problem and we actually develop a theory to explain the performance of the model reprogramming. So we are able to show that the performance on the target task in terms of the risk function can be upper bounded by the performance on the source task plus some representation alignment loss between the source and the reprogram target data. So this is kind of intuitive in the sense that if I have a perfect dog versus cat classifier and we are going to reprogram that classifier to do autism spectrum disorder classification, then if we can assure the representation of the dog can be aligned with the representation of the ASD patients and also the cat's representation can be fully aligned with the non-ASD patients, then we will know applying this pre-train foundational model for dog and cat classification can also solve the problem of classifying ASD versus non-ASD classification problems. So that's the intuition of why we have this representation alignment loss. To make this idea more concrete, we first look at the distance between the source and the reprogram target representations during training. So we will observe that as the model reprogramming trends, the input transformation layer, we do observe the distance between these two distributions start to decrease. And as these two distributions start to decrease, we do observe the loss on the target data start to decrease as well. And in the same time, the accuracy of the target data start to increase, which totally explains the theory that we identified. So once the two data distributions are aligned, you will get a reprogram model that is as reliable as solving the source task. So on the right, I'm just showing a more visual illustration in terms of how these reprogramming works. So here we are showing you the data distribution before reprogramming and comparing to transfer learning and also the data distribution after reprogramming. So in this case, we would see that before reprogramming, the two sets of data from different classes are not separable. So it will give you low classification accuracy. And even if you do fine tuning, the steel separation power is not great. However, with transfer learning, we are able to almost perfectly separate these two classes and achieve high accuracy. So this holds when we go to a more complex data with more classes. So this is really the power of reprogramming. So finally, I would like to give three takeaways. The first thing is I introduced a model reprogramming in this talk, which I believe is a very powerful paradigm for doing data-limited cross-domain transfer learning with these large pre-trained models or foundation models. And I show you some empirical success in general imaging to medical image. And also human voice through time series. We actually show we can also reprogram a natural language processing model to molecule learning problems. And we could talk more about that in a discussion time. Finally, I provide some theoretical justification in terms of why this model reprogramming works. And we can show that the target test can be solved as efficiently as the source test if their representation can be perfectly aligned. Lastly, we also release our codes for our published papers to reproduce our results. And also here are some references for those of you who are interested in getting to know more. With that, I will end my talk and give it back to Shaheen. Thanks, Pinyu. As we jump into our discussion portion of this talk, I want to take a minute and invite IBM clients to submit any questions you have through the Webex chat and we can certainly fold them in. But to get us started, Pinyu, it sounds like from what you've said that reprogram really offers a significant improvement in efficiency over retraining and fine-tuning an existing model. Could you elaborate a little bit for us on kind of the differences between retraining and fine-tuning and reprogramming and maybe talk a little bit about when you might choose one versus the other? Yeah, Shaheen, I think this is a great question to help us summarize what we probably observe in our research. So if we think of this access as some sort of data scale, like we have a very limited data to a medium-sized data to sufficiently many data. So in the data-rich regime, either fine-tuning or training from scratch will give you a good performance because it's data-rich and we should let the machinery model learn how to solve the task by exploring the data. However, things become more interesting and challenging when we go to the data-limited regime where we don't have enough data to train a large model like a neural net. So in this case, we really need to leverage the power of the model reprogramming to use the knowledge we learned from other domains and also reprogram the model in an efficient way to solve the challenging problem. And similarly, for transfer learning, it's kind of best suited in the medium-sized regime where we have some data but not a lot but it should be sufficient to do fine-tuning. But here we are really interested in the resource-limited regime where we don't have much data so even fine-tuning will not work as a baseline. So we really need to think of something outside of the box and be more efficient to leverage the data to do our job. So it sounds like the availability of data is one critical factor. What about cost? That's something you had also alluded to. Yes, so one thing we were very excited about is that having seen this such a big investment in terms of training a new pre-train model, foundation models that I have the millions of dollars of cost, we would like to reuse it as much as we can just for a computation of randomly also for reusing the assets that we already have. So I think other than data efficiency, this compute efficiency is also something that the reprogramming function will give you because we are only training the input transformation there instead of finding the entire network. So in terms of a number of parameters we need to train, it should be more compute efficient. That's something that most people can appreciate. You had mentioned GPT-3 as an example, I think most people are familiar with as being very costly and as a potential candidate. Would you say that all foundation models are good candidates for reprogramming or are there any particular criteria that might favor one set versus another? Yeah, this is a great question. So we don't have a clear answer yet, but we can kind of get some hints from the theories that we developed. So we can do model selection based on looking at the representation alignment laws that we created. So if we have the luxury of having to select one out of multiple foundational models to do reprogramming, we can simply look at the representation alignment laws and just choose the model that has the smallest representation alignment laws. And that in principle will give you the best performance on the target task we are solving. Well, that makes a lot of sense as well. I wanna switch gears a little bit because I know a topic that comes up in general in this space is the importance of trustworthy models and explainable models. And so I'm wondering how that is gonna play in with using reprogramming on an existing model. How can we evaluate the trustworthiness of a reprogramming model? Yes, I think this is a great question. So I think model reprogramming is the first step to show we are able to obtain some high accuracy and high performing model on a new domain with limited resource. And after showing the success in terms of the classification function, for example, the next thing of course we need to care about is the trustworthiness of that reprogram model. But luckily we already have a lot of tools that can help us to explore these trustworthy dimensions like IBM has to give a lot of open source libraries like AI, fairness 360, expandability 360 or adversarial process toolbox. So because of the reprogramming nature we can think of our model as nothing but a new neural network or a machine learning model and we can already apply these existing tools to inspect the trustworthiness of this reprogram model and make it more trustworthy if necessary. That's great. I think being able to reuse the existing tools will make that much smoother. So I have a slightly more tactical question. How do you map labels in your new domain to the labels in the old domain? Would you have to retrain both input transformation and output layer label, excuse me, output layer label layers? Yes, yes, yeah. So this is indeed a challenging problem. So we tried several strategies. So first we do this random random assignment between a source label to a target label. And we also do some frequency based the mapping try to map the source label that has the more response to the target label in the first place. But in images we do see some performance difference when you do this frequency mapping versus random mapping but for a time series case we do not see much difference. So, but I believe in general we need to find a more principle way to automatically map the source and target label. And so that we can maximize the performance of model reprogramming. That makes a lot of sense. Well, I know we're almost out of time. So I want to ask you a final question around the future vision for this research. Where do you see it going and how do you see it applying in real world applications? Yes, so first of all we are very optimistic about this technology, especially applying this technique to resource limited the domains with the power coming from the foundational models in other domains. But that being said we are still facing some challenges and we are some ongoing work. So I think the most important thing is how do we automate the pipeline of a model reprogramming? And ideally it should be as easy as someone will give us a task that they are going to solve. And we are going to automate the pipeline in the sense that we are going to search the best available model to do reprogramming. And then we will also try to optimize how do we maximize the modules in reprogramming like the input transformation layer or the output layer assignments that we just discussed in the automated fashion and return the optimal reprogram model to the user. So the user can use it without no worry about how to do this model or architecture selection. It's an ambitious vision but it's a really exciting one. And so with that I think let's wrap for today. It's been great spending this time with you and learning more about model reprogramming. And we'll look forward to seeing what happens next. Thanks for-