 So welcome everybody to the BioXL webinar series. BioXL is an European Center of Excellence for Computational Biomolecular Research. Today we have a webinar on a prototype system for achieving integrative structure, PDB-DEV, and the presenter of today are Brinda Vallat, from Rochkaer University, a Benjamin Web from the University of California, San Francisco. I will host this webinar and together with me, it's also Julian Singh from University of Edinburgh. Now we go to, I want to introduce you to the presenter of today. So we have Brinda Vallat, Brinda is assistant research professor at the Institute for Quantitative Biomedicine. As a computational chemistry and her research interest is in modeling and analysis of micro-molecular structure. Her aim is to build tools for data standardization, validation, archiving, and from since 2015 she's working with Ellen Bergman and Andrei Sali to create the intrastructure for archiving integrative structure of biomolec- biologic and micro molecules. The second presenter of today is Benjamin Web, University of California, San Francisco. He's a scientific software engineering in the group Andrei Sali. He's the lead developer of Intregative Modeling Platform and here you can see the web page of the platform and this plot is used to elucidate the structure of large micro-molecular assembling. Now I will start to give the word to Brinda that she will start the presentation and then we will move to Ben and then we will come back to Brinda and at the end we will have the question section. Hello everyone, thank you very much Alessandra and Julianne for giving us this opportunity to participate in the BioExcel webinar series and talk about our work regarding the development of the PDB Dev prototype system for archiving integrative structures. What is integrative structure determination? Integrative structure determination is an innovative approach to modeling the structures of biological systems based on data produced by multiple experimental and theoretical methods. It is typically used for modeling the structures of complex macromolecular assemblies that are not amenable to traditional methods of structure determination such as x-ray crystallography, NMR spectroscopy and electron microscopy. Integrative modeling involves combining data from multiple experimental methods such as x-ray crystallography, electron microscopy, NMR spectroscopy, chemical cross-linking, small angle scattering, fluorescence resonance energy transfer etc. The experimental data are gathered and converted to spatial restraints and combined with known starting structures of molecular components using modeling algorithms to build the structures of complex macromolecular assemblies. Before I go into the details of archiving integrative structures I would like to give some background information regarding the protein data bank and how the project of archiving integrative structures began. The protein data bank is a single global repository for experimentally determined structures of biological macromolecules and their complexes. It was established in 1971 with seven structures. It is the first open access digital resource in biological sciences. It currently contains over 160,000 structures of macromolecules. It is managed by the worldwide protein data bank organization that consists of the RCSB protein data bank in the US, protein data bank Japan, protein data bank in Europe and the Biomag Res Bank. The PDB archives structures of macromolecules determined using x-ray crystallography, NMR spectroscopy and three-dimensional electron microscopy. The structures archived in the PDB have atomic coordinates and metadata. The PDB also archives certain structures that are determined using multiple methods. These methods usually use one of the traditional methods such as x-ray, NMR or 3D EM in combination with other methods such as neutron diffraction or small angle scattering. And these multi-method modeling studies result in structures at atomic resolution. Here is a plot of the multi-method structures in the PDB. You can see that these structures have been increasing in the past decade. So the next question is can the PDB archive integrative structures? There are some technical challenges to archiving integrative structures in the PDB at present. These challenges arise due to the following features of integrative modeling. Integrative modeling involves restraints from a multitude of experiments such as chemical cross-linking, small angle scattering, forced-resonance energy transfer, etc. And integrative models are multi-scale. They are not necessarily atomic and can consist of both atomic and coarse-grained representations. Integrative models can also have constitutionally and confirmationally diverse assemblies. And they can use starting structural models that come from experimental or computational methods. Due to these reasons, the current PDB infrastructure is unable to handle integrative structures. In order to address the challenges involved in archiving integrative structures, the Worldwide Protein Data Bank set up a task force consisting of researchers from different fields contributing to integrative structural biology. The first meeting of the task force took place in October 2014 at EBI in Cambridge. Here is the summary of recommendations from the task force that was published in 2015. The task force recommended that structural models, data, metadata, and workflows need to be archived. A flexible model representation needs to be adopted. Methods need to be developed for assessing model uncertainty. A federated network of archives needs to be created for structural models and experimental data, and publication standards need to be established for integrative modeling studies. This talk will focus on the implementation of the first two recommendations, which is archiving structural models, data, metadata, and workflows, and adopting a flexible model representation. With this background information, we can now state our objectives as follows. Our goals are to build a system for archiving integrative structural models based on the recommendations of the WWPDB Integrative Hybrid Methods task force, and to develop the tools required for archiving integrative structures. While building the archiving system, we will adhere to the fair guiding principles of scientific data management, where archived integrative structures will be findable, accessible, interoperable with other data resources, and reusable for future applications. What are the requirements for archiving integrative structures? The first requirement is the creation of data standards. Data standards are the primary requirement for archiving and enable automated data processing, as well as dissemination of data in a standard form for downstream applications. Data standards encompass definitions for the data that need to be archived, such as the chemistry of macromolecules and small molecules, and the definitions for multi-scale model coordinates. They also need to support the information required for model validation, such as the spatial restraints used as input data, and the starting structural models used. They also need to include descriptions for metadata information, such as authors, software, citation, samples, experiments, workflows, etc. The second requirement is the archiving system. We need to build the infrastructure required for data collection, processing, archiving, and data distribution. Data collection involves collecting the model coordinates associated with spatial restraints, the starting structural models, and the information regarding the modeling protocols. Data processing involves any added curation that may be carried out, model validation, and visualization of the structural models. Data distribution involves creating mechanisms for web services, data download, data set discovery, and search and report tools. Together, these constitute the data pipeline for the archiving system. So let's look at the first requirement, which is creation of data standards. I would like to introduce the PDBX-MMCIF data standards used by the PDB to archive structures of macromolecules determined using x-ray crystallography, NMR spectroscopy, and electron microscopy. PDBX-MMCIF specifies the standards for representing macromolecular structures. It is designed to be extensible in order to handle evolving methods and structural biology. How is PDBX-MMCIF different from the widely used and well known PDB format? Let us look at an example to understand the differences. The figures here show how atomic coordinate data are represented in the PDB format and in MMCIF. The PDB format has implicit definitions and is limited by fixed columns that cannot handle large structures. Large complexes with tens of thousands of atoms cannot be handled by the PDB format. On the other hand, PDBX-MMCIF definitions are explicitly defined in the PDBX-MMCIF data dictionary. The MMCIF data files clearly include the names of the data items and are not dependent on columns organized based on fixed widths. This makes PDBX-MMCIF flexible and self-consistent. PDBX-MMCIF can handle large structures and thus overcomes the limitations of the PDB format. So we decided to extend the PDBX-MMCIF data standards for integrative modeling. Standard definitions for the chemistry of polymeric macromolecules and small molecules. Atomic model coordinates and supporting metadata information such as source organisms, samples, author citations, etc. are taken from the PDBX-MMCIF master dictionary. And the new definitions that are required for integrative modeling are created in the IHM extension dictionary. What are the different features of the new extension IHM extension dictionary? The dictionary has added definitions for multi-scale multi-state ordered ensembles. What does this mean? Ensembles are a set of models that correspond to multiple solutions. As I mentioned earlier, multi-scale refers to models that have both atomic and coarse-grained representations. Multi-state models correspond to models in multiple conformational or constitutional states such as bound and unbound states or open and closed states. Ordered models correspond to models that belong to a time-ordered path or another path such as models of macromolecules belonging to a biochemical pathway. In addition, we can have any combination of the above four features. For instance, we can have two multi-scale states each described by an ensemble or an ensemble of pairs of multi-scale states. So the IHM dictionary provides a flexible model representation that can handle multi-scale multi-state ordered ensembles. Another important aspect of the dictionary is that it provides definitions for starting structural models and spatial restraints from different experiments used as input in integrative modeling. These are important for model validation. The dictionary provides definitions for experimental and computational models of components that may be used as starting models. This includes template details for comparative models as well as definitions for coordinates of the starting models. The dictionary also contains specifications for describing the spatial restraints derived from experimental data such as chemical cross-linking, 2D electron microscopy class averages, 3D electron microscopy maps, small-angle scattering profiles, etc. In addition, the dictionary has provisions to capture the provenance information for starting models and spatial restraints. If the starting models or input experimental data are archived in an external repository, then the corresponding accession codes can be provided. If not, a digital object identifier can be included for the input data from which the restraints are derived. For example, if 3D EM maps are used, then the maps can be archived in the EMDB repository and the corresponding accession code information can be included along with the integrative model. So the IHM dictionary provides the required definitions to handle starting models and experimentally derived spatial restraints. The dictionary also provides a generic set of definitions to handle the steps involved in the modeling workflow. For example, an integrative modeling study might involve the following steps. Gathering experimental data and starting models, translating the experimental data into spatial restraints, configurational sampling to obtain an ensemble of solutions, and then analyzing and validating the ensemble. Generic definitions for capturing these steps in the modeling protocol are described in the IHM dictionary. To summarize the data standards in the IHM dictionary, the details of the molecular components such as proteins, nucleic acids, small molecules, and the corresponding references to the sequence and chemical information databases, all these are taken from the PDBX MMCIF dictionary. The definitions for the starting component models, along with references to models that may be available from structural model repositories, are provided in the IHM dictionary. Similarly, the definitions for spatial restraints derived from experiments, along with references to data reciting in experimental data repositories, are also provided in the IHM extension dictionary. Together, these are applied to build an integrative model, which can be any combination of multi-scale, multi-state ordered ensembles. The flexible model representation provided in the IHM dictionary allows for complex integrative models to be described in a standard way. The IHM dictionary is open source and available from GitHub. I'm sorry. Now we move on to the next requirement, which is the creation of an archiving system. We have developed PDBDEV, which is a prototype system for archiving integrative structures. The data standards for PDBDEV are provided by the IHM dictionary. If you recollect from this earlier slide, we identified that the archiving system requires tools for data collection, processing, archiving, and data distribution. We will now look at each of these aspects. PDBDEV data collection. As a user interested in depositing an integrative model in PDBDEV, what information would you need in order to submit the structure to PDBDEV? The following are the requirements for submission, atomic or cos-grain model coordinates, polymeric sequences of macromolecules, and chemical identifiers for small molecules, details of model representation, structural assemblies, and ensembles, starting models used and how they map to the models deposited, spatial restraints used as input, accession codes or digital object identifiers, pointing to the experimental data and structural models in other resources, details of the modeling protocol. Structures can be submitted at this link, which is also available from the PDBDEV website, and MMSIF file compliant with the data standards in the IHM dictionary is required. What are the current processing procedures in PDBDEV? The current procedures include ensuring that the data are complete, including the representation of the molecular system, the model coordinates, the starting structural models, the spatial restraints, and details of the modeling protocol. We check for compliance with the IHM dictionary, we visualize the structures, and then issue an accession code. We also allow for structures to be kept on hold before publication, visualization of integrative structures. There are two applications that enable visualization of integrative structures, the Chimera X desktop application and the MOLSTAR web application. Both these applications support the visualization of multi-scale structures with atomic and coarse-grained representations. The image shown here corresponds to the integrative structure of the 552 subunit nuclear pore complex from budding yeast archived in PDBDEV and visualized using Chimera X. This model was obtained using spatial restraints derived from 3D EM maps, chemical cross-linking data, and small-angle scattering profiles. PDBDEV data distribution. We have a newly redesigned dynamic responsive PDBDEV website. We provide a new functionality to search and retrieve archived structures in PDBDEV using keywords. There are also individual pages for each entry that provide information regarding the structure, author, citation, software, input data types with links to data in other repositories. Model coordinates can be downloaded from the website and models can be visualized using MOLSTAR. This slide shows a snapshot of the structures archived in PDBDEV. The structures vary from small and medium-sized complexes such as the nucleotide excision repair complex, the mitochondrial iron-sulfur complex, and the 16S RNA complex with methyl transphrase A to very large complexes such as the nuclear pore complex and the RNF ring domain complex with the nucleosome. This slide shows the PDBDEV status as of May 2020. There are 39 release structures in PDBDEV at present with an additional 11 entries that have been processed and are kept on hold for publication. The plot on the left shows the statistics for the different modeling software used. There are several structures that have been obtained using the integrator modeling platform Rosetta and Haddock, and there are additional structures that have been obtained from software such as TADBIT, FPAS, Explore, PatchDoc, iSpot, BioEN, DMD, etc. The plot on the right shows the statistics for the types of experimental data used to derive the input spatial restraints for integrative modeling. Chemical crosslinking is predominantly used, followed by electron microscopy, small-angle scattering, NMR spectroscopy, foster resonance energy transfer, electron pattern magnetic resonance spectroscopy, mutagenesis, etc. I would now like to give a brief tour of the PDBDEV website. Please bear with me for a moment while I move to the web browser application. This is the current PDBDEV website. Once you reach the website, you could do a keyword search or browse for all the current entries. For example, if you search for 3DEM, you get a list of all the entries that use 3DEM data in the modeling. You could define the results or you could display different results in the tablet form such as citation, release date, software, etc. If you go back to the homepage, you can also browse all the list of all the structures. There are currently 39 entries. Let's scroll. You can look at everything at one shot and get an idea of what's there. Let's pick an entry here, the nuclear pore complex. If you go to the entry page, you can get information about the structure itself, the kind of input data that was used, links to data in other resources. You could download the structure or view the structure, the 3D structure in MOLSTAR. Let's give it a minute for MOLSTAR to load the structure. This is the three-dimensional structure of the nuclear pore complex visualized using MOLSTAR. You could take a snapshot of the structure or MOLSTAR provides various tools that can be used to further analyze the structure. Going back to the homepage, there is a link to deposit new structures. User registration and login credentials are required to submit structures to PDB Dev. Let's go back to the presentation. With regard to the software tools, we've developed a Python IHM software library to support the IHM dictionary. This library has been developed by my co-presenter Benjamin Webb and it provides a mechanism to programmatically generate MMSIF files compliant with the IHM dictionary. We will now transfer the presentation to Ben who will describe the Python IHM software library. Over to Ben. Okay, thanks Brenda. Thanks Aledondra. So yeah, as Brenda said, I'm going to talk a little bit about the software, which we call Python IHM that we use to actually generate these models for deposition in PDB Dev. And as Aledondra said at the start of the talk, I'm a scientific software engineer. I work for Andres Shally at the University of California, San Francisco. So just to recap this slide that Brenda showed earlier, our objective in Andres' lab is to generate these integrative structures. And we developed a piece of software that we call the Integrative Modeling Platform or IMP that does this. It takes these various kinds of experimental physical and statistical data sources and combines those with some information. We have structural information about the individual components and builds a model that is as consistent as possible with all of that data. The IMP software, we developed it. We designed it as a Python and C++ toolbox. It's a bunch of little parts that fit together in various different ways because Integrative Modeling is a very broad field. There are lots of different problems that we try to solve. The complexes vary in size dramatically. As Brenda mentioned, there's a lot of different ways in which we can represent them. Lots of different sources of experimental information that we use and combine in different ways. And there's even some differences in how we actually do the sampling and scoring of candidate solutions. And so that's why we developed this software as a sort of a mix and match where we're putting these parts together to build these different protocols. So that software has been developed. We've used it for several years to generate a wide variety of these integrative structures. The code is all open source at GitHub. So our background is in modeling. But now we have to sort of put on a different hat. We have to think of ourselves now as depositors. And so we need to figure out how we're going to take our models and generate them in the format that would work with the deposition system that Brenda talked about. So we wanted to come up with this software that we could then tie into our software so that we could programmatically generate these IHM compliant MMSIF bars. We didn't want to be generating these things manually. We wanted to come out automatically as part of our modeling protocol. And as a bonus side effect, we kind of wanted to also be able to take those MMSIF files and read them back in or use them as inputs for further modeling, for example, and edit them to add additional information, and so on. So we wanted a piece of software that could read and write these MMSIF files that have this IHM data in them. So traditionally, we've generated PDB files, but we can't just take a PDB file and convert it into a file because, as Brenda said, the IHM dictionary is a superset of what's in PDB. So that would get us the coordinates, which is only a very small part of the problem. We don't have any of the experimental information that we use. We wouldn't have any of the modeling protocol, any of the additional metadata that's very useful in order for someone to then take that model and refine it or understand where it came from. So we have to think really about how it ties into the modeling procedure itself. So we need to take those, the actual coordinates, and also merge that in with all the other input information that we have to generate that complete representation of an integrative model in an MMSIF file. So the Python IHM software, it grew out of the software that we developed as part of IMP, and it's now, we've put it in a form whereby it's standalone and people, other people, other than us, can use it for their own modeling. So it's designed to take that same data model that Brenda talked about and to represent that data model as a hierarchy or a set of Python classes and then provide a mechanism to take that hierarchy of Python classes and write out an MMSIF file or even to take an MMSIF file and read it in and generate the Python classes from that MMSIF file. So the intention is that it plugs into modeling or visualization software so that it can be used as a transition layer between the internal representation in that modeling software, whether that be our own software IMP or other modeling software such as Haddock, for example. And or it could also be used in a standalone fashion whereby you take your input files and you run a little Python IHM script and generate the outputs that way. So the data model in Python IHM, similar to that in the dictionary itself, the hierarchy of Python class is that it's not exactly a one-to-one mapping, but it's it's it very closely corresponds to the dictionary, the IHM dictionary and the PDBX dictionary. So the Python classes roughly correspond to those tables in the MMSIF file that you saw earlier. And we end up with this hierarchy of Python classes where at the top you have a what we call a system class that represents the entire modeling system and that will point to various other classes. I won't go through all the classes here and they're all linked at the Python IHM website. But for example, a typical system will have one or more models. For example, you may have an ensemble of multiple models. Those models in turn will contain some structural information in the form of atoms or coarse-grained spheres, which then would refer back to the information about the chains and the primary sequence of those that coordinate information in the ACM unit and MC classes. Additionally, there are other classes that will tie in there. I've given two examples here is that there's a dataset class where you can list all of the input or output data that's associated with your system and protocol class where you can give some information about how the modeling was done. So on the right here is a fragment of an MMSIF file. This is a similar syntax what Brenda showed you earlier in the AtomSite table. These are the two tables that show the chains in a PDB file or PDBX, I should say NSF file, and the entities. So the structosim table here is listing essentially the chains in the file and the entity table is listing some information about the primary sequence of those chains. You'll see that the tables are linked by numeric or string identifiers. So the table on the right shows an entity that has an ID of one. The table on the left is a structosim, an ID of A which refers to an entity, that entity of ID one. The equivalent Python code is shown on the left here where we create a system object and then inside it we pack the entity. So we define the sequence of one of those entities and then we define the A chain. So the primary difference here is that we're using Python references or pointers throughout, so you don't have to manually keep track of the various identifiers that link these tables together in MSIF, which if you're trying to generate the file manually can become a housekeeping issue. The Python HM library takes care of all that for you by just by nature, the nature of references between Python objects. The concern and part of the decision we made when we designed the library was that we tried to make it lightweight. So if you're not careful, you end up duplicating the entire data structure that's in your modelling program in the Python HM classes. So we try to avoid duplicating the data instead. So for example, all the atomic information or the coarse-grained structural information isn't directly stored in the Python HM classes, but they kind of act as fathards, there's wrappers if you like. They delegate that information through to the native modelling software objects where possible. So we're not generating like an entire separate copy of the model using twice the memory and so on. So just a few features of the library and part of it is that the part of the reason for not having a perfect one-to-one mapping from tables to Python objects is that then the API of the library is fairly stable. So when the dictionary itself changes, we update things, changes a little bit about how the tables relate to each other and putting new information in and so on. The Python HM API doesn't change so that we don't have to then change all the way in which we're generating these Python scripts. You could just rerun the script and it will generate a new version of the file that corresponds to the new dictionary. It supports both the IHM extension that Brenda mentioned earlier. There's also a FRET dictionary to cover FRET data and that's also supported by the Python HM library. Of course, it does support a subset of PDBX just for all that structural information that that's also in standard PDB files. It is extensible so in principle you could add support for other MSIF dictionaries if you wanted to or even add custom categories. So in IMP, for example, we have a handful of small data items that just keep track of the information that's of only of interest to us as modelers as opposed to depositors. The MSIF reader is reasonably fast because it's accelerated. It's not the fastest MSIF reader in the world but it's relatively fast and because we have this stable API that sits on top of the data representation, we're not limited by the MSIF format itself. If you have another file format that conforms to the same data model, we can support that. For example, we support binary SIF which supports all the same data that's in MMSIF but it's a binary file and there's some basic support for actually reading the dictionaries themselves so you can read in MSIF and validate it against those dictionaries. So as I mentioned earlier, we try to use the library in three different ways. So one way would be to use it as part of an existing modeling software package. So as part of that software taking your input information and generating models, Python HM will also sit in there somewhere and generate MSIF files. Secondly, you could use it standalone if you have your models on your other dates, you can run through Python HM as a Python script and generate MSIF files and thirdly, you can use it as an input mode whereby you're taking MSIF file and reading in that file and creating that hierarchy of Python objects. So let's look at each of these three in turn. So first, we can use it as part of an existing modeling software platform. So we have it tied into our package, imp. So when we generate models with imp, imp's internal data model is mapped to the Python HM data model. So it ends up generating that top-level system objects in Python HM that contains all the information about the modeling. The idea is it doesn't generate the MSIF file itself but it generates that top-level object which you can then further manipulate. And there's a couple of examples here if you want to go and look at those later for two publications in the last couple of years where we use this software. So I just want to make a point here as an aside is that the full story is that for modeling, we're not usually using the raw experimental data. We usually use some sort of data that's been pre-processed in some fashion. So for example, if we're using cross-linking data, typically we get a CSV file from our experimental collaborators that just has a list of all the residues, the cross-link residues. So we have a list of approximate residues. That's not the raw experimental data. We don't have all the spectra that is really of interest if you want to reproduce that modeling. The imp software doesn't care about that. It only cares about that CSV file of the processed data. But when we come to deposition, we really want to put that original experimental data in. So we don't want that CSV file. We want the accession code of the cross-linking data. And maybe it's in a database like Pride or Massive, for example. And of course, there's other metadata that we want to put in the file that of course just simply wasn't available at the time we did the modeling. So the publication of the modeling course isn't available until after we've done the modeling. So it's obviously not there. So that's why we have this mechanism for then using the Python HM API to take that system object that we generate as part of our modeling procedure and tweak it and adjust it to add additional information before we actually write out the NSF file. To help you a little bit, it does have some support for extracting metadata such as PDB headers from the files automatically. So you can ask it to go in and pull some of that information out. So rather than just having it linked to say we'll use a particular PDB file, it can go and look in that headers and it can find out what the actual identifier was in the PDB database, which is more useful for someone who's trying to reproduce your work. So the second way in which we can use Python HM is standalone. If you've previously done some modeling and you've generated some sort of coordinate file and you also have some of the other information you used, you just have to write a little Python script that goes through and gathers all that information together, runs it through Python HM and generates that NSF file. You can do that after the fact. And there's another example here of a system we published fairly recently where we did that. And that's also in the Python HM documentation. And finally, we can use NSF in an input mode. So it does have full support for reading in those MSF files and populating the data model. So you might want to do that if you want to visualize the models, or we want to use them as input for further modeling. It's a chimera X that Brenda showed in one of her slides earlier. That actually does use Python HM internally. So when you're reading an MSF file with chimera X, what it's doing is it's running through Python HM and it's generating, it's extracting that hierarchy of Python objects from the MSF file and then it's then converting those into chimera X's own internal representation of the structure so that then you can display it. So just to conclude here before I hand back to Brenda, the Python IHM software, it's all open source and you can get it directly from GitHub. Or if you also, you can also install it with pip. So it's on PyPy. You can get the latest Python IHM code installed that way. It's a very permissive license. So it's designed so that there shouldn't be any issue in combining it with a typical software package, open source or not. It doesn't have any awkward dependencies. It's only using pure Python, a little bit of C for the reader if you want to do it that way. So you only need the Python standard library. It works with both Python 2 and Python 3. And of course, it's designed to be used not just with our software, but with any modeling package. As I mentioned, there's at least some support in Haddock to use it. Of course, we welcome contributions from anybody that wants to also develop the software. So with that, I'll pass back to Brenda to conclude. Thank you, Ben. I just want to quickly touch upon what we're currently working on. We are working on developing methods for validating integrative structures. And we're also working on building the infrastructure for automated data harvesting. We need automated tools for data harvesting because users have model industry and data in different formats. So we're creating an automated system for harvesting diverse data. Future perspectives. Our goal is to build all the tools required for archiving integrative structures. Once the mechanisms for collecting, processing, validating and archiving integrative structures are fully established through PDB Dev, these structures will be moved into the PDB archive and the tools will be integrated with the tools in the PDB. Another long-term goal is the creation of a federated network of interoperating resources that contribute to integrative structural biology. This goal poses additional challenges such as development of a common set of data standards and also tools for federated data exchange. In 2019, a workshop was held during the annual Biophysical Society meeting at Baltimore. The title of the workshop is Walking Towards Federating Structural Models and Data. The workshop focused on addressing the challenges involved in creating a federation of integrative structural biology resources. The outcomes of the workshop have been published in a recent paper. The creation of a global network of interoperating data resources will enable the growth of scientific research and also make it possible for the structural biology community to tackle very large structure determination challenges. I'd like to thank our team at Rutgers, including my mentors, Helen and John, who have provided invaluable guidance and support throughout the project. I'd like to thank my colleague Kathy Lawson for advising us during the development of the PDB Dev website. Maryam Fayazi was a grad student at Rutgers who implemented the front end for the PDB Dev website. Our collaborators at UCSF from Andre Shalie's group, Ben, the co-presenter today, has developed the Python IHM library and has also held immensely with the development of the IHM dictionary. Cy and Ignacia are working on developing methods for model validation. I'd like to thank Carl Kesselman, Hong Suda, and Sherbin at USC for collaborating with us on building the automated data harvesting system. I'd like to thank all the WWPDB directors for extending their support for this project, especially Stephen Burley, the director of RCSB PDB for his continued encouragement and support. I thank the Chimera X team, Tom Farron, and Tom Goddard from UCSF. MOLSTAR is actually a collaborative project between RCSB PDB and PDBE. I'd like to thank the principal developer of MOLSTAR, Alex Rose, at the UCSD division of RCSB PDB. I'm grateful to all my colleagues at RCSB PDB for their support. I thank the PDB Dev depositors for providing examples and working with us patiently while we build the system for archiving. I thank all the members of the WWPDB integrative hybrid methods task force for their recommendations and continued guidance. Funding, I'm grateful to the United States National Science Foundation for providing funding for this project. Thank you all for your kind attention. Please visit the PDB Dev website for exploring or depositing integrative structures. Now we move on to the QA session. I'm happy to answer any questions. Thank you, Brenda and Ben, for the very interesting talk. And thank you also both for getting up as early as you've had to to give us Europeans the chance to hear what you've been working on. For questions, if you just want to write your questions in the questions tab, we can get through a couple of questions, but probably not going to keep everyone too long. Our first question is from Alexandre Bovin. Alexandre, I've unmuted your mic if you'd like to ask a question yourself, otherwise I'll read it out in your study. Hi, Brenda and Ben, nice to listen to you today. Hi. So it's a question probably I asked it several times. I know the PDB people, for example, PDB, they have a kind of an editor of MMC files that they use internally. So are there any plans to provide some kind of friendly interface to users to fill in information into these integrative modeling MMC files? Not all users will be able to handle a Python API. Yes. Let me go back to this slide where I said we're building the infrastructure for automated data harvesting. That is exactly what we're doing. We will have a system for testing probably before the end of summer, and we should have it out for friendly users who could test the system. It will have an interface for providing all the data that is required and additional metadata as well. So that is the goal of that project. Okay. So it's not automated in a software workflow, but it's more an interactive way for the user to input data? Yes. It will be an interface where a user can come and put everything they want and click a button and finally generate an MMC file that can be submitted to PDB. Okay. Very good. Thanks. Thank you. Thank you for that answer. The next question we have is from Carol Burka. I have unmuted your microphone. Carol, if you would like to ask your question, please go right ahead. Otherwise, I'll read it for you. Okay. Thank you. It was a really nice presentation. May I ask about the ordered states which were shown on one of the first slides? What is their definition and where can I find it? Because I'm trying to look through the GitHub and I don't know where exactly I should look for them. You're talking about the multi-scale, multi-state ordered ensemble? Yep. Oh, the ordered states. The ordered states are actually defined as a graph of connected models with models being the nodes and the pathway or the ordering information as edges. It is in the, there is documentation in the GitHub. On the GitHub site, if you see dictionary documentation. Yep. Yeah. And if you go into that, there are some notes on how to define the ordered states. Okay. Did you? Yeah. Or you could also look at the dictionary, but then that will be too complicated. The documentation tells you where to look at. Okay. Thank you. You're welcome. Thanks for that. The next question we have is from Dikos Sumpasis. Dikos, I have unmuted your microphone. Please feel free to ask your question if you want. Otherwise, I'll read it for you. Can I speak now? Yes, you're welcome to speak. Well, I have a trivial problem for this excellent software development, which is I use a lot of old PDB files for my modeling stuff. Can I transfer them to to the new format MMC, which I hate as a theoretician, using your app, your Python IHM? I'm sure Ben will answer, but let's say I want to combine various protein segments from all proteins, which are in the PDB for years, but they are good enough for my purpose. Can I then combine them with using your system? But they are in the old format. They are in the old format. Yeah, I understand your question. Yeah, sure. So there are actually a couple of different tools for doing that. If you're actually carrying out integrative modeling, in which case you need to handle a lot more information regarding the integrative modeling itself, then Python IHM is the tool to use. Definitely, it can handle all that. It can take a PDB file, and you can create all the Python classes if you want to add any of the integrative modeling information, or just use the PDB file and convert it to MMC. But if it is not integrative modeling, if it is just a regular PDB file, there are also tools that are provided on the PDB website that can, there are software tools that can convert PDB to just plain PDB X and SIF. Okay. Can I integrate theoretical information? For example, concerning the hydration, which nobody can handle nowadays, but I can handle it. Can I combine a theoretical, can I take a model from your database and just input my theoretical prediction for the waters there? Theoretical prediction for waters, you mean? Yes, for the hydration of the system, because all the systems they are, most of them they are vacuum thoughts, but I work in the wet environment. Right. So the dictionary does allow for water molecules, I mean the solvent, and you can of course add that, add the coordinates for waters as long as you follow the definitions in the dictionary, but then you- No, they are not giving enough water information. I can predict what they don't see in the experiments. Right, that's because we archive what the users provide, so if the users did not model the water, then we don't have that information. I model the hydration theoretically, so can I input it to the system? Can you input it? Right now, no. Actually, if you added theoretically and then you want to submit it back to PDB Dev, then yeah, right now we don't support that. Okay, thanks a lot. I mean, that's a water-bounded brush. Yeah, yeah, right now we don't support that. Okay, thank you. Thank you for that. Alexandre, who asked the question previously, has also pointed at a couple of tools available online, I'm currently just copying them into the chat, but otherwise I think that that is all for questions for today, so I'd like to take this opportunity to thank again our speakers for taking the time to tell us about their very interesting work, and also to let the audience make the audience aware of the upcoming BioXL webinars, one on the Haddock 2.4 server, which is taking place on the 11th of June, and we've got a couple of webinars in autumn also lined up, and you'll hear more about them either during the Haddock webinar or if you go on bioxl.eu. There's a tad for webinars, and all of our upcoming webinars are there. Thank you very much for coming, everyone, and we all hope that you have a good day. Thank you. Thank you.