 So hello everybody, I'm Eva and even though this is a Python conference I will not be talking too much about Python itself but I would like to tell more like a story behind the research I'm working on and it deals with noted proteins So proteins those tiny things that are in all our bodies the key players that makes our bodies work If I give you an example of hemoglobin that's something you all probably know That's a protein that goes from our lungs to all other parts of our body and carries the oxygen. So it helps us breathe You know that proteins are kind of poison to us and it's honestly pretty fascinating to study them So let's have a look at how protein looks like When the protein is being created it looks like a long string of tiny beads Where the beads the building blocks of the proteins. It's called amino acids And we have 20 different types of them. You can imagine it's like 20 different colors of the beads and from these beads You're making the whole protein As long as the proteins finished it does not stay in this like linear form but it immediately falls into some 3D structure and in this 3D structure it stays for its whole life and The 3D structure is like crucial for the protein because it determines the function of the protein so if you look at the Structures here for example at the orange one it looks maybe a bit like a U shape like a pocket So its function might be that some our molecule comes and sticks to the pocket or the blue thing there It's like a tunnel So it's said somewhere in a cell membrane and it guards which things can slip through in and out of the cell Let's have a look at the folding itself When you have this protein sequence somewhere naturally maybe in a cell It can fold into the 3D structure like immediately in milliseconds but when you're a scientist and you have this amino acid sequence, which is quite easy to get with experiments How long do you think it would take you to get the 3D structure and we can do like Warmup activity where you can do a measurement with your hand. It will be like Hours maybe days months you can show it If you have any guess Yeah, great. Most of you is right. It's maybe somewhere there because if you Know what you're doing if you have some protein you've maybe worked with already it may take you months but usually it's more like years and There's a joke It's pretty bad joke because it says that every protein in a database took a PhD life So it's not that great, right? This difficulty in obtaining the three structures of proteins is very nicely demonstrated in the size of protein databases We have so if you have a look at the unit of database, which is database of all protein Sequences it is something like 300 millions of sequences if you want to work with that You have to class through database so you end up with something like 50 millions of sequences But if we have we want to have a look at the three structures That's like the important thing because you want to know the function of the protein. You're at something like 190 hundreds Thousands records so it's not that much right and this disproportion between getting the protein structure and protein sequence is really nice demonstrated in the size of databases and This poll shows that it will definitely do get better in the future is so Maybe it would be really good for us if we can have some predictions tools that will help us to get the three structure of the protein We can plug in the protein sequence as it'll predict for us the three structure Luckily you have something like that and there's a really nice competition about this it's called casp and it runs every two years and The guys from the competition publish some protein sequences with unknown structure and Basically anybody can join this competition with a models and they try to predict the structure of these protein sequences And then the organizers come they get the real structure and they see who's doing the prediction the best In the past years the lady both looks at like that on the x axis you have all the models that join this competition And on the y axis you have like how well they do in the prediction and And then your 2020 came and there was one tool like completely rolled over all the other tools it was all cut this was called alpha fold and The best thing about this is that this tool was so good that Its precision was almost at the level of what we are able to achieve with our experiments With those costly experiments, which we're running for years. So that's a great thing to have And let's now have a look inside at how alpha-fold looks Not surprisingly, it's a machine learning thing that is trained on all available three protein structures and There's a pretty nice schema, which we can split into like three parts The first part deals with encoding the structures the protein sequences So actually until one thing which is pretty important and interesting both proteins And that is if you have two protein sequences that are kind of similar It's very problem that they are three structure will look very similar So on this example we have hemoglobin protein sequence from human mouse and fish and If you have a look at the human sequence and the fish sequence Maybe only like half of the sequence is the same, but the three structure looks almost the same And it's because the function is the same So if you go back to the schema, that's exactly what's happening in the first part You have your protein sequence in the input and what you're doing is that you're Looking through a database of protein sequences and trying to find something that would be similar to your input sequence And what you hope for is that you can maybe extract from there are some patterns and Get something that might be useful for modeling the three structure Then the input is like a matrix of all pairs of amino acids in the sequence and they potential contacts and Then a logical thing you're looking for some already existing three structure that would help you to model your input protein Then the second part is like in the processing of this input information Maybe it's put it there the word ever former it sounds a bit like a transformer and that's almost what's inside there and The third part is like the modeling itself So what your output is like XY coordinates of each amino acid in the input protein Now if you would like to predict your bright new proteins three structure it's honestly, not so easy because this whole thing is really big it's computationally demanding and One structure may take several hours to compute So people started thinking can we do it some of faster? Maybe like it's great We have this thing that can create a three structure in one hour, but you know We are doing these experiments for years but still it might be something that will do that better and The idea is that we spot what is the slowest part and that was like this database search for some similar proteins What we can do here is that we replace this slow part with some language model Maybe that would capture the idea of the whole protein universe And it would maybe understand the proteins the same way as the database can and it would much faster, right? So Here comes the SM fault, which did exactly this thing it took Basically the same scheme as alpha-fold head, but they replace the slow part of their own language model trained on proteins and Now we can do the prediction in second. So that's great This is a bit longer interaction about proteins how they work these three structures and Now let's shift a bit and let's talk about large language models So what is that? What is life language model? In a simple words, it's type of an AI that is somehow able to understand and generate the text Usually these models are trained on a huge data sets for example all articles from Wikipedia or something like that and They try it in a very specific manner On the input you're giving them a Sentence for example here we have there was a king who had 12 beautiful daughters and in the sentence you Hide one word. So this daughter's word will masked and what you want from your model is to predict you the Distribution of this hidden word. So what this model does is that it predicts a couple of words with some probabilities Where the word daughter scores with the highest probability so that's exactly what you want And what you hope for is that this model can capture somehow the idea of the text of the language However train these types of models is extremely demanding because you have to have a huge data set model itself Is really big so usually only big companies can afford this But when you have this model trained, it's really good and you can use it for some subsequent tasks So for example like classical thing when you have movie reviews and you're classifying them into positive and negative What you can do is that you can take some of the already trained models You basically cut the last layer which was predicting the next word and to replace it with a layer that will be predicting positive and negative and to only fine-tune this small on your specific task and What you hope for is that the model already understands the text So it's easy to just slightly adjust it and it will up to whatever you want Nowadays this fine-tuning has become really easy because we have for example hugging face So for those of you who don't know that it's like a little advertisement. It's a really cool thing It's a let's say platform for the science and machine learning where you can easily share your models and data sets and Fine-tuning has become this simple. It's like whole code for fine-tuning the model In the first lines you just download the model that is already available to hugging face You just load your data set you train in and that's it Now you might wonder how is this nature language related to proteins. It's a bit different, right? Not really because you can imagine that the protein is somehow language it's pretty easy because each amino acid has its letter code so you can imagine that each amino acid each building block of the Protein is a word and you can then do with the protein the same thing as in normal language You can mask some amino acids and you can train a model that we predict in which amino acid is missing People have already tried it. So we have for example Prodbert model that is trained on like whole database of protein sequences If you spotted the word bird inside this name, yes It relates to the bird architecture in like normal language. It's the same thing, but just trained on proteins and Another example is ESM model, which is something I already mentioned what we're predicting the three structure. It's like the faster tool So When we have this most available for hugging face It has become really easy basically for anybody to play with proteins. So even you can try that pretty easily and Finally we're getting to something. I'm playing with and that's proteins with knots And it sounds a bit weird, but it's pretty fascinating thing and It's built on a whole field mathematical field. That's from like 18th century. It's a not theory and It's just Categorizing different types of knots and trying to distinguish them from each other So on this picture of like different examples of the knots from the simplest one Which is like a knot to three more not which is something you prove a tie on a shoelaceous or somewhere and many many others And maybe here to just explain a terminology the first number here refers to the number of crossings in the knot and The second number is there just to distinguish different types of knot with the same number of crossings How does this work with proteins is basically the same thing But proteins are but not complicated So you maybe don't see at the first look if the protein is not it or not But most of the proteins are unnoted meaning if you pull the protein from both ends it will untie But there exists some noted proteins most of them have this most simple 3 1 not but there exists some other guys like 6 1 not or maybe Even double 3 1 not in bacterias Even though not it proteins were studied for quite some time We're still not 100% sure was the purpose of not so there are some ideas maybe the knot tries to prevent the protein from degradation or For example, we know that the knot creates in some proteins the active site of the protein This is like responsible for the function of our protein. So it's like the very important part of the protein But what we'd really love to know, but we still don't is if there exists any amino acid pattern It would be like responsible for the knotting and that's the place when we came up with our research and Our idea is that we take protein sequences and we will build a model on them Which will try to categorize them if the protein is unnoted or not it and then would try to interpret Somehow normal and see if we can extract from it some patterns of the knotting We want to train Machinery thing you need some data for it, right? So one would say you're very lucky you have this alpha fold tool that will predict you many many protein structures so you can just look through this Database and see if we have some noted structures take them fix and unnoted build a dataset from that Easy task. Okay, it's not that easy because if you do it with this simple approach You will end up with the cities completely biased and the model of was learning like absolutely nothing about the knotting So you have to be a bit smarter It's in every machine learning project the dataset building to us like 80% of the time and When you already think you're finished is like somebody coming Well, we may be forgot about these things. So we have to rebuild database and You have a very beginning. So after like half a year. We finally got something and We ended up with building it almost manually the database because We wanted to be really sure that the proteins we have there are really not it so we manually Manually treated these proteins and took only protein family that was sure with for which existed some already Experiment determined the restructure. It was not it With this approach, we got quite a night dataset, which was had some like 200,000 proteins It was nicely balanced. It was really good for us and I already mentioned that we are now available these protein models so we tried to use one of them prod bird and find tune it on this dataset and Before we actually started the fine-tuning we're curious. How much does the model already know about nothing? It's already shown that these models have some understanding of the protein pictures for for example, if you Get protein embeddings. It's like the inner representation of the model about the protein The model already knows for example From which kingdom the protein comes or for example, which structure it possesses so we tried it for our nothing problem and We saw that the model already has some understanding on the nothing even before it was trained So it was a good thing for us We then took our data set we tried to train the model and we got pretty nice results With the overall curious around 98% It was actually a bit suspicious. We were worried for example that the model might learn only to recognize the biggest protein family so we already also checked this thing, but it looked completely fine So we then approach to the interpretation part when we try to see What the model thinks that might be responsible for nothing but which patterns in the sequence We tried a couple of things, but the best working was our custom technique when we're like patching parts of the sequences and for Each patch we saw how the model prediction would drop So basically if you cover part of the protein sequence how much does the nothing score drop and with this approach you can get the place where the drop is biggest and With this patch you can basically break the knot So you can then aggregate all those patches and see if you can extract some biological meaning from them We tried it for one particle protein family Well, we observed it the model primarily focuses and the end of the not core and We're also able to get some pattern in there. It's just pretty interesting and We observed that this Pattern is closely related with the function of the protein But we would like to continue with this. We tried the interpretation only with one family so it would be good to extend it to some other families and What's a bit like a crazy idea? But what might work someday is that we would like to create our own protein with not like completely artificial design thing you must ask is this somehow useful and Honestly, this is a bit tricky question for us, but I'll try to give you some examples For example, there's a research that says that improper formation of a knot might be somehow related to your visiting and Another thing we know that some not protein families for example spout family It's very important in our body and that the knot is really tightly related to its function And one of the function can be related to bacteria resistance So maybe the understanding of the nothing would help us to develop some new targets for antibacterial drugs and To conclude this talk and just remember proteins are cool especially not the proteins are cool and Research on proteins is something really neat as laughing I would like to thank to my team for this project and Also, thank you for listening So if I understand you at this moment just distinguish if it's knotted or unnoted and nothing More like you don't try to distinguish the knots like what kind of types of nuts not yet But you might want to try that in the future and basically earlier samples if I understand are completely Like most of the samples you were trying are at this moment Artificially made like most of the knotted proteins are predicted yeah, and Are you able to to see like maybe? Sorry later, it's fine How much is the research considering modeling of proteins limited with post-translational Modifications since I guess it's hard to predict them and this influence is destruction the catalytic catalytic activity as well Do you know how far you can get without considering this or are there already options to consider this? Like no idea, but post-translational modification at this stage Okay, are there plans for the future? From my side probably not because I'm not like a lot person But maybe somebody else has this idea Was just a matter of interest. That's what's really interesting presentation Thanks for your talk. You were talking about a solution based on on Bert Part of library Have you tried any any other deep learning based solution? Yeah, we tried other transform based things like this ESM It worked very similarly So we just stick to this one and we also tried to use Simple convolutional network it honestly worked pretty well a bit worse But still real well, so it looks like this nothing problem is somehow quite simple to distinguish and What you also tried to do is to take just the embeddings from the bird and then try Trying some small thing on top of that. It also worked pretty well, but we just stick to this thing Really good talk I was interested in the LMM stuff So you said you would train it on a protein dataset afterwards. Have you tried just dropping in Say a GPT-3 and seeing if it can Dissern the different proteins just as is We all think about it. I think guys would find something like that, but yeah connect to those guys over there Okay Hi, very nice talk and very interesting research Question is regarding the not proteins and the problem of finding which protein is noted and not noted How did you how tech which technique did you use to to distinguish between these two two sets? There's a patent package. It's called topoly It's done by guys from Poland They're doing these regions for quite some time So they already know how these things work and what you do is that you take your 3d structure and to run this tool on that and do the output to you which type of note you have there Okay, but you based on the on the protein sequence and then you move to the 3d presentation by some kind of Dynamics modeling go. How did you get the 3d presentation of the protein by alpha fault? It's predicted. We do not have any questions on the discord If we have any questions in here, we still have time Anymore no, then thank you ever for your great talk and