 Yeah, so as Patricia said, I am a postdoc in Torsen-Schwidergruppe at the Biosentrum of the University of Basel. And I would like to take this opportunity to tell you a little bit how we are leveraging the deep learning revolution to study the diversity of the protein universe. So finding a new biology in catalog proteins, but also unknown evolutionary relationships and how deep learning and did this revolution by applying deep learning to structural biology, how it helps us having a much clearer view of what we know about proteins. So we know from classic biochemistry classes or molecular biology that life is the result of the interplay of multiple macromolecules and that proteins play a very big role in life being as it is right now because they built most of the machines of the cell and they are found across all kingdoms of life. They are essential and they are everywhere. They are inside of the cell. They are in the membranes. They are outside. They have a wide diversity of functions. And this relates of course to their single amino acid sequence and their 3D shapes. But if we look at protein databases like Uniprot, there's about 250 million unique protein coding sequences that we know and they are the result of sequencing over 500,000 organisms again across the tree of life. And when we look at how much we know about them, especially how, what are their names and which kind of annotations are over these proteins, we see that roughly 60% are fully annotated because they were experimentally characterized or because there's homologs that were experimentally characterized and then there's about 20% that have just very ambiguous names like uncharacterized protein, the protein of an unfunction, iboretical protein and there's no annotations for them. So we don't know just from their single page what they are and what they may be doing. And this is a relatively large fraction, 20%. And it's large but it's biased. There's much more we know about specific organisms or the proteins in specific organisms than others. So if you look at the human proteome, this little brown purple bar, it goes up to just 7%. Of course, from a human biology perspective, from a medical perspective, it's a relatively large number. It's about 7% of the proteins encoded by the human genome. We don't know what they do. But if we go to the environment, for example, especially to marine environments, there's a much higher fraction of such proteins. For example, this cyanobacteria, it's about 30% of the proteins encoded by their genome cannot be annotated and it's just named hypothetical proteins. So the question is, well, what is hidden in these proteins? Why are they so poorly annotated? Is it because they are just remote versions of proteins that we already know what they do, but they just go beyond the detection horizon of classic homology-based approaches? Or more importantly, and also more interestingly, from my perspective, that they may correspond to new biology, that we don't know what it does. Especially when it comes to pathogens or organisms that live in us, like species from our microbiome. This could be that these proteins are just completely new interactors or different mechanisms that these organisms have that we just don't know about them. So it would be really interesting and really important to come up with ways of shedding lights into these unknowns. So defining which ones could be, we're responding to new biology and which ones are just very distant homologs of proteins of non-function. And to do that, I like to use the analogy of the protein universe. This is not something that I came up with, but it's a concept that I really like and that basically drives on the research that I do now. So if we think about the diversity of proteins that could be constructed from an alphabet of 20 amino acids, there's basically an infinite number of possibilities. And we can see these possibilities as a very large landscape of possible protein sequences. And of course, within these are those that nature sampled. And so these are like galaxies or stars in these universe. And those that are functionally annotated are bright stars because we can see them, we know what they do. And so protein families would be clusters of proteins, so clusters of stars, so galaxies. And super families would be clusters of galaxies, all dispersed throughout the space surrounded by these dark regions of protein sequences that nature could sample, but so far didn't or we just don't know it did. So if we think from a structural perspective, what we do or what we do when we're characterizing this space and we're looking at characterizing proteins experimentally, what typically one does is you select one star in one galaxy and then characterize it experimentally in the case of structures. You could solve the structure of it and then using homology-based approaches. You expand the annotations to the rest of the galaxy and also maybe the super galaxy. And this means that we are only able to annotate or extend biomology the information to those galaxies or super galaxies that have at least one element that is experimentally characterized. Of course, there's also some cases where we don't have structure and these galaxies still remain bright, but we just don't have structures. Now, thanks to the deep learning revolution, we now have access to structural information from basically the entire catalog of proteins in uniprot. And this is thanks to AlphaFold, which was introduced this morning. So what they did after AlphaFold was out, they took entire uniprot or at least a big fraction of uniprot and predicted structures for all of these proteins. This means that they predicted structures for proteins that have already structural characterized. Members, fine, good. Also for those bright ones that don't have structural information. So those proteins or those galaxies that correspond to families, then we know what they do, but we just don't know how they look in 3D. But more importantly, they also predicted structures for those uncharacterized hypothetical proteins. And so these are now the cases. I mean, now there's all of this information that we can try to learn something about those proteins of a nonfunction that I introduced in the beginning. So what we did then was to leverage all of this information. So the annotation information in uniprot, the pre-computed clusters that uniprot provides, which is called UNRF, and the structural information in the AlphaFold database to model this landscape, at least as covered by the AlphaFold database. So very simply, this is a bit of a complex slide, but in very simple terms, what I did was to take the pre-computed clusters of sequences from UNRF 50. So what UNRF 50 is, is it took in the entire sequences from uniprot and also some from unipar can cluster them based on sequence similarity. And UNRF 50 is a sequence similarity of 50%. So all proteins that are inside of the same cluster, they have a minimum 50% sequence identity, or yeah. And between clusters, between cluster representatives, you would expect that there would be a maximum of 50% sequence identity. And then for each of these clusters, for all of the elements in the clusters, I collected all the annotations that are available for these members and then stored that in the database. And then for each cluster, I looked at what is the protein that is the most well annotated for domains but also predicted structural features like coil-coiled and disordered prediction and ignored all those annotations that correspond to domains of a non-function, hypothetical domains, putative domains because these are those that would be dark. We know this should be a domain but we don't know exactly what it does or if it does what we predicted us. And so this coverage with annotations, we named it the function of brightness. So the more well annotated full coverage of protein is the brighter it is. And so by selecting the one that is the best annotated for a Uniref 50 cluster, we're then selecting the one that is the brightest and so a Uniref 50 cluster would be as bright as the brightest of the protein in its cluster. So if there's a cluster where none of the proteins is well annotated, it's no annotations in full protein so it would be a brightness of zero so these are dark Uniref 50 clusters. So after having this complete annotated set of Uniref 50 clusters, what then the idea was to take all of the members all of the representatives of these clusters and construct a sequence similarity network based on local sequence homology. And this we could do using different metrics but the first one we tried just because this was a principle we're trying to develop the approach. We used MM6 which is really fast to deal with very large sets of sequences and then we're able to construct a network. Now one important thing is that we didn't construct the network for the entire alpha full database because that corresponds to about 250 million proteins. This means 50 million Uniref 50 clusters and there's the distribution of model confidence throughout the alpha full database is not so high or well it varies, there's high but it varies a bit. So we decided to focus first on those Uniref 50 clusters where there was at least one member in the cluster that was model that the really high occurs predictor accuracy with alpha fold. So a PLDDT over 90 and this reduced the set to about 50 million proteins which is about 6 million Uniref 50 clusters. So that was easier to deal with still pretty large and the largest ever set to use to construct such a network. So it's still the largest that we're able to construct. But we have to bear in mind this only corresponds to about 10%, 15% of all the proteins. But what we could see and this is the network that we constructed is here on the right is that we can find different regions where the dark Uniref 50 clusters so our dark proteins that we don't know what they do how they are distributed in the network and you can see different regions where they are. So for example we find clusters like here on the left cluster one where all the proteins, all the Uniref 50 clusters in this cluster are all very well annotated. So this is a completely bright cluster and this corresponds to a family of lipid carriers. But then you also have those cases where all of the Uniref 50 clusters are dark. So there's no protein in these Uniref 50 clusters that are well annotated and all of them form a cluster by themselves that don't connect to any bright ones. And so these would correspond to what we call unknown protein families. Of course you can also have those where there's a mix. There's a mix of Uniref 50 clusters that are well annotated with some that are a little bit less annotated and then a little bit less annotated in a gradient as here in the mixed clusters labeled region. And these are those cases where those dark proteins non-annotated proteins just correspond to remote homologues very remote homologues of bright ones. And thanks to including all of the intermediates evolutionary intermediate sequences we could find a path that connects those dark difficult to annotate proteins to their relatives that are functionally characterized or functionally annotated. And what we also saw from this cluster from this network is that it's actually a rich source of new and unknown protein families. We found a really large number of clusters where everything is purple. Everything is dark. Another thing that I also wanted to point out about our network and you can see when looking into it is that there's a middle where everything is interconnected. This is what I call the middle blob or the inner blob. And this corresponds to about 50% of our set of sequences here. So these correspond to inner 50 clusters that cluster together but also have some kind of local homology to other inner 50 clusters. And this you can think about it a bit as you have local shared routines that are being shared between proteins. So we know that proteins new proteins emerge by the shuffling of unknown of already defined protein domains. And so if you have protein with domain A and B and then another one with B and C and then with C and D of course these will be connected because A and B will connect to B and C and then this one then will be connected to the other one. So you have this network of shared local motifs between different families. But then you have this the remaining 50% which is this ridge here which everything is purple which are those single individual families that don't have any obvious similarity to anything else that we know. And these would be those cases that correspond most likely to completely new protein families that now we can leverage the structural information from the AlphaFold database to learn something about. So one example is this cluster 159 that I showed in the beginning. This cluster has proteins mostly from prokaryotes so there's sequences from archaea and from bacteria. And what I show here on the left is the predictive structure of AlphaFold. And now when taking this predictive structure and search it against the PDB so the protein data bank which has all of the experimental protein structures up to date deposited there, I did a structure search against the AlphaFold and against the PDB and I didn't find anything that looked like it. So this would suggest it's a completely new predicted fold. And so I can also not learn from close of proteins that look similar. I could not learn from those in the PDB. So then what I did was to leverage the methods to use protein structures to predict go terms. And this method that I used is called deep fry. It's a neural network that was trained based on structural information to predict go terms. So gene ontology terms that relate to function. And then as a side result of how the network was trained it also provides you which residues contribute the most to that prediction. So indicate a possible place in the protein where the predicted activity occurs. And I got two different predictions using this method. One is DNA binding and the residues in red are those that contributed the most to that prediction. And I got also a prediction on hydraulic activity acting on ester bonds. And those residues are in red are those that contributed for that prediction. So this would suggest some kind of hydraulic activity over DNA or some kind of nucleic acid. Then what I did was to dig a bit further into this possible family. Try to understand if it's a family, if this cluster is a family or a super family and how different the members of the super family are. So I collected an enriched set of sequences homologous to the proteins in this cluster and constructed again a sequence similarity network for just these sequences. And this is what we see here in the middle. And what you can see is that there are at least seven different clusters of proteins that form basically seven different families. And so this cluster 159 actually corresponds to a super family. And as these proteins are all in prokaryotes, I could also learn from the genomic context where these proteins are encoded about if they are in a possible operon or if they may interact with some or be co transcribed with some other proteins. And what I saw was that by using the sequences in the different clusters as input for genomic context and analyzing tools like GCSNAP or flags, I saw that the members of different clusters were encoded next to different protein coding genes, but always with a concert feature, which is a piece of stronic arrangement. So it means these proteins, members from cluster 159 and members from each of these families here annotated in the sequence network that I showed here in the middle. So each member would come, would always be with another gene conserved across all of those genomes. They would be always with another gene. But this partner gene was different between the different families that make the 159 superfamily. And all of these were also hypothetical proteins. None of them were annotated. So all of them were dark too, are dark actually. And in one case actually, our 159 target was actually fused with another domain, which is also a domain of a non-function. The only one that could give us some hints was the partner of D members in cluster 1, A, B, and C, in cluster 6, which is an homologue of a relb protein. And this is an antitoxin. And indeed, this piece of stronic arrangement is very characteristic of toxin-antitoxin systems, where there's a toxic effector that would kill the bacteria, but it would be encoded or would be transcribed exactly at the same time with the antidote, which would neutralize the toxin. And so if there's some kind of stress, for example, phage infection, the antitoxin would unblock the toxin and would lead to the death of the bacteria in order to try to prevent the entire population from being infected by phages. And so this, in collaboration with Professor Tanya Tansen in Tartu and Vasili in Lund, they did some experiments on this and validated this hypothesis, which is that this bistronic offer on the whistle is a toxin-antitoxin system. So here on the right, what you see is the experiments that they carried out on top is the toxicity naturalization assays. So having E. coli expressing only the toxin, well, it's not happy, so it dies, because, well, that means this may have some kind of toxic effect over E. coli. But when it expresses also the cognate, so the RelV-like protein, there's a recovery of the phenotype and there's a recovery and E. coli is happy to live, which means that the cognate is neutralizing the effect of the toxin, which is our 159. And when doing metabolic labeling, they saw there's a reduction in the incorporation of methionine but no reduction in the incorporation of ureidine or timidine, which means that there's a reduction in protein production which then suggests that our members of cluster 159 have a translation targeting toxins in prokaryotes. Another example is cluster 3314. This is an interesting example for a completely different reason, which is all of these proteins from this cluster have completely different predictive names. They would be apocetical in the beginning when we started our work, but in the meantime, their name changed, and I will show you why in the next slide, but their name changed, so they completely have new assigned names. And when we looked at the distribution of names within them, they were all over the place. They were going from pro-phage protein to not-age-kinone-oxidoreductase to integrase, so things that were not really related to each other. And we didn't find any clear homology to anything known. Even at the structural level, we found some hits to tubulin-binding domain, but at the sequence level, we could not really see a clear relationship between them. And when looking at the genomic context, there was really no consurginomy context between the members of this cluster. But when looking at the distribution of which kind of genes were around it, we saw that there were many proteins that are somehow associated with pro-phages, so this indicates that the pro-phage protein title may be correct, but we still don't know what is still. But what this is really interesting about it is this diversity of names, because they are the result of using language models to predict names of proteins of unknown function or hypothetical proteins. So since about last year, yeah, at the end of last year, UNIPROD started using a language model, PRODNLM, to predict names for hypothetical proteins. And so this is a language model that was trained on all the names and sequences of proteins well studied in the UNIPROD. And then I was asked to predict, given a protein sequence, to predict a name for it. But our hypothesis back then was that, well, if this would work really well for those proteins that are remote homologs or those hypothetical proteins that are remote homologs of proteins of known function. But if it's a completely new biological system that has no characterized homologs, then what would this language model do? We know that language models, they have a bit of an ability of hallucinate things when you asked them something that they don't know anything about. There was this discussion a few months ago about HPT, for example. So we had this hypothesis that maybe this language model would, given true, completely unknown protein families, would hallucinate their names, would give a wide diversity of names to these members. And so we could probably distinguish between those dark galaxies that correspond to completely new protein families from those that are just remote homologs of really well-studied protein families based on the distribution of names that are predicted. So this is what we did. So we tested this hypothesis, we took all of those dark galaxies in our network and checked what was the diversity of words that were used or that were predicted for the names of its members and compared to those in those that are completely, fully bright, so really well-studied. And we saw that there was a clear separation, that dark galaxies have a higher tendency to have a wider diversity of names than those that are bright. And so we put a cut at 20% word diversity and this corresponds to about 290 of our galaxies and then teamed up with a PFAM team, so with Tony and Alex in the PFAM at EVI, which are biocurators and they define the families in PFAM. So we gave them the sets of sequences and they started curating them and saw that indeed all of the most of these dark galaxies were completely new, dark families that have not even a DOF associated to them. And so now they are new, they are assigned in PFAM at least to a DOF. They could remain dark, but at least there's a family attributed to them. And this is a good example on how doing these large-scale approaches and using, by knowing how these deep learning methods work, how we could take a pitfall of the method and use it in our benefit. And of course, now also the way that this language model used to predict names in Uniprod has changed slightly. Now there's some more conservative thresholds, so if they don't have this so much of this hallucination. Now, as I said, we are leveraging first those proteins that have high predicted accuracy structures in the AlphaFold database. So if there's structures, we could also do clustering at the structural level. We didn't do that because it was also someone else who was doing that. And so I completely recommend you that if you're interested in this topic to check out the AlphaFold clusters done by the Belfdauen-Steinager labs. And this was published at the same time as our work. And it's a really nice resource to find proteins clustered at the structural level. And then you can see how different clusters also relate to each of what is similar clusters to the protein you're interested in. So it's a really, really interesting resource. What we did instead to look at structures was to leverage these predicted folds to try to find which ones could correspond to unusual folds. I see there is a question in the chat, but I don't see my mouse, so I cannot select it, but maybe we can discuss it. I can read through how to uncover the unknown protein which contains highly repetitive sequence or super genes region in known model species. Second part of the question. Yeah. Okay. So actually where I'm going to is going in the direction of repetitive protein. So you can already go from that. And also maybe what I will introduce later with our resource maybe also helps. Otherwise, when I finish the talk, we can discuss that if I don't answer the question. Perfect. Thank you. So from the structural perspective, we wanted to look at also what are the novelties there, right? Now we have all of these predicted structures and we know that there are proteins of a known function there. Could they also correspond to new folds or do they have predicted folds from which that are similar to proteins of a known function so we could learn from those folds? Of course, we could do that using Martin's and Pedro's resource, but we went in another direction and we used... I'm so sorry. I have everything in front of me. I'm just going to... I have a lot of... So yeah, there was the zoom bar and then images were in front. I could not see my slides. But anyway, now I can. So what we did was to look at unusual... Try to develop the way to look for unusual folds based on local structure representations and their distributions. So Janani was another postdoc in the group. She developed a way that takes the protein structure and breaks the structure into overlapping substructures within a sphere. So for amino acid, it goes through the amino acid in the sliding window and breaks it into... Within a sphere, breaks it into a super secondary structure element, basically. And then she trained the network with constructive learning that uses all of these structural fragments to... And discretize them into one of 1024 shapemers. So it's basically... Had this network uses constructive learning to convert the structural representation into an alphabet with 1024 elements. And then you can see these shapemers. So these words as shapemers as words in a text. So these are shapements in a protein structure. And then each protein is represented by the average of these shapemer representations. And so then you can use this shapemer representation, which is a fixed length vector, and compare it with the distribution of shapemer representations in the protein data bank. So if a protein has a shapemer representation, that it looks like the shapemer representation of things we see in the PDB, then this would be common in the PDB and it would be an inlier. But if it doesn't look like those in the PDB, it would be an outlier. And so here this is very useful, for example, to detect a diverse set of exceptions that you can find in PDB. One of them is highly repetitive proteins. So that goes into the question that was just asked. So in the PDB, we know that there's not so many repetitive proteins. And also when you have a repetitive protein, the number of repetitive units can just basically amplify, especially if it's an open-ended repeat. And so you would expect that the shapemer representation of highly repetitive proteins will be very different from those we see in the PDB. So these will be highlighted a lot as outliers. But other cases are also obligate oligomers. So in the AlphaFol database, you only have proteins modeled a single chain. But we know that there are proteins that actually they need partners to adopt the correct oligomeric state, the correct structure. And so if you have such cases, you would expect that the predicted structure would not look so well, so much as you see in the PDB. But you also have fragments, for the same reason of repetitive proteins where the number of the same structure shapemers would vary a lot. In proteins that are fragments of other proteins, the structure representation of the fragment would look very different from the expected real structure. So these would also be highlighted as outliers. But we also have novel folds. And we could find all of these cases in our network, especially in the case of novel folds, we're really excited to see these new predicted fold, which we call the beta flower, which are just these beta barrels, repetitive beta barrels that we could find with a different number of airpin repeats from four to six. And when we look at the structure, it really looks like a flower because there's a long loop that makes it look like a petal. And we saw that actually most of our set in our atlas is an inlier, but we have clusters of outliers that fall into all of these categories. But still when comparing darkness distribution between inliers and outliers, we saw there's much more darkness in the outliers, in the structure outliers than the inliers. Now, we found this to be a very useful network, and so we made a resource with the support of SIB, which we call the Protein Universe Atlas, which is an interactive version of this network that you can play with. This is a video that I will not explain in detail, but basically you can see the different kinds of annotations over the network. You can query by Uniprod AC, by Uniprod AC, you can input your structure, so search for structurally similar proteins, you can input your sequence, and this will also then allow you to download all of the data associated to the unique nodes that where your matches fall. As you can see here, like you have the table, and you can export. Now, what's the next steps? Of course, we're not stopping here. This was just the starting of a really interesting line of research, in my opinion, but now there's a lot of things that we can improve on, and we're working on that direction, and currently one of them is to improve on the network and on the way that we construct the network. As I said, for constructing the network, we used MM6, which is a classic homology-based approach, but we can also use a leverage language models to try to get more homology than where we can get with MM6. And here, Lorenzo, which is a PhD student in the group, is leveraging the intermediate representations that language models provide for proteins to detect remote homology relationships. So if you remember from the morning, what language models typically do, protein language models typically do, is they take a protein sequence and then convert each residue, embed each residue into an intermediate representation, which are residue embeddings. What we could do would be, again, as for the structural outlier method, we could average everything and then have a fixed-length vector for each protein, but we found out that to find remote homologies, this averaging method is not the best one to do. So what we do instead of what Lorenzo is doing is to develop a method where if you have two proteins you want to compare and you have the per-residue embeddings of each of them, you can construct a per-residue embedding vector distance matrix. So you basically would expect that residues that are in the same sequence and structural context, and so they could be related, so they don't need to be exactly the same amino acid, they can just be a different amino acid but have the same kind of context that emerged through evolution. You would expect that in this space, in this embedding space, they would either be close to each other or their embeddings would be correlated. So you can compute this per-residue distance matrix that then can be leveraged to score the similarity between the two proteins. And so Lorenzo took this matrix tweaked it a bit, added some normalization to the distances and was able to come up with EBA, which is the embedding-based alignment method, to score the possibility of two proteins to be similar or not. And here are examples on how the power of the method. So on the left we have the comparison between proteins that have a very low sequence similarity, below 30% sequence similarity, but have the same structure, the same fold. And you can see here that there's a clear diagonal in the matrix that corresponds to a very high score or a very low distance in the space. But if you have proteins where the sequence similarity is really low, still lower than 30%, but a completely different structure, then you don't see these diagonals. And then here in the middle you have one case where you have a share of motif. So this case is actually repetition of the same motif, but still the proteins have lower than 30% sequence identity. And you see the two diagonals and you see the two proteins overlap really well. Then when benchmarked for label transfer, and this is in transferring of the... So if you have two proteins, you have your target protein and you want to know if it belongs to the same fold family as another protein, we could see that EVA is very accurate and has much higher performance than MM6. MM6 has a really lower performance, but EVA, although it's a sequence-based method, it really competes with the state-of-the-art structure-based method, especially when the normalization that Lorenzo implemented is there. If it's just a simple distance in the embedding space, you have much lower performance. And you can also see that EVA is slightly better than full-seq, which is a structure-based method. So we are confident that now using EVA over the pairs of proteins where we didn't find any similarities with MM6, that we are going to find much more remote homology relationships than we had before, and so we can improve on the landscape that we constructed. So with this, I'm finished and I hope I could convince you that now, thanks to the Deep Learning Revolution, we are closer than ever to exploring uncharted territory of the protein universe and that the AlphaFold, thanks to the huge success of the AlphaFold method, we can now have structural information, at least high-acquired structural information for many of those hypothetical unknown proteins that actually correspond to new biology and now we can leverage that to learn more about what they do and that our network approach really helps us to more easily pinpoint those cases and prioritize cases for experiment. And with it also that automated annotation is a really tricky task and it requires a combinational method and approaches. So I want to thank you. I want to thank Patricia for inviting me to present about our work and to you for listening and of course to the entire strategic group for helping me with the work and hosting me in this work and to our collaborators that contributed a lot for this work to be so impactful. Thank you very much.