 So thank you and thanks for everyone who is attending. So I'm going to try to go through a lot of things with you. I hope it won't be too much and so we'll be opposed. Suppose is slightly, the timing of suppose is slightly different than what is written on the web's form because in fact, the first part after the introduction, it's difficult to cut it into two parts. So I will finish this very big part of the detailed description of Salazarus at 4.20 and then we'll have 20 minutes of pause and then continue with all you can use, search in Salazarus and future development, conclusion and of course your questions. Okay, so let's dive in and of course the subject of today are mundane to cell line and everyone which is working in labs, in academia and industry are using cell lines. I mean, cell lines are ubiquitous. I mean, there is hundreds of thousands of labs which are using them and we will see this in more detail. Some of the cell line are acquired from cell collection. Some are transferred between labs and this can cause issues that we will also discuss but since we will come back to it, where do people get their cell line? And of course, there is a number of varieties of cell line. I won't go into this course in the biology of cell line but of course, I mean, people have heard of cell line like ELA, which is a cancer cell line but there is many other type of cell line transformed by virus, there is embryonic stem cells, pre-operative stem cells, hybridomas to produce monoclonal antibodies. So there is a whole menagerie of cell line which are used depending on what people are doing in different labs. So, I mean, people should think widely about cell line, not only think about cancer cell line which of course are used widely but it's not the only type of cell line which exists. And in terms of resource around cell lines, there are many cell line catalogs, we'll go back to that, least an ontology and many specialized databases we'll cover in the next five to 10 minutes what exists in terms of ecosystem around cell line in terms of database and website. But until cellosaurus was established, it was no single resource where you could have this information collected and start from this resource and go to a lot of other place to collect all the information you want on a given cell line. Now, why is the cellosaurus? In fact, there is no single decision taken to build the cellosaurus. In fact, the beginning before the cellosaurus became named as a cellosaurus. I mean, it was in fact, that we needed inside next part which is a human knowledge platform or human protein knowledge platform, sorry. We wanted to annotate as precisely as possible experiments carried out on human protein or on the ortholog of those human protein in other species. So we wanted of course to say which tissues they came from, which species, and of course, as many experiments are done in cell line, which cell line. So at that point, we never thought that we would use a controlled existing control vocabulary or ontology and just annotate or experiment by saying this is done in cell line XYZ using succession number of that ontology. But in fact, there was no resource which was comprehensive enough and contain all the cell line we needed. At that time, we didn't need hundreds of thousands of them. We needed only hundreds of cell lines. I mean, but even such that this number was quite low. There was no resource which listed those cell lines that we wanted to indicate who were annotating those protein experiments. So we created what was to be a small cell line tazaurus and serophores named cellosaurus, a cell line tazaurus. But more and more persons became interested in it and asked if it was possible to add more cell line, but also which is much more important and which I mean, in fact, allows this course to exist to broaden the scope of what was annotated inside the cellosaurus. And in fact, what is known as a cellosaurus compared to what was before is totally different. And this I can show in a semi-graphical way you see here, it's of course unreadable on your screen, but you see just a size of a cell line entry with something like 15 lines of text in the first release in 2012. And the number of line of information for the same cell line entry in release 37, which is three release away. I didn't update this slide because it would just add more line it would become anywhere even smaller as a text. So, but you see the principle, not only was the number of cell line increase across time and we'll see that later, but also what is captured in cellosaurus increase quite dramatically. And we will see what you can find in cellosaurus in the biggest part of this course, in fact. Now, I mean, people, I was speaking before ontology we're looking for ontology or database on cell line and of course we were specifically looking for an ontology. Now, there was at that time two ontology specific to cell line, one which still exists, which is called a cell line ontology, which has currently about 39,000 terms which describe 30,000 cell lines. So there is an amount of redundancy because the same cell line can be represented by more than one terms, but it has high quality in defining cell type and cell categories. But of course, compared to all the cell lines that exist and which need to be captured in the database like a cellosaurus, this is not complete enough and it's an ontology. So it's very useful to classify cell line, but it doesn't contain a lot of information on the cell lines, which is now captured in cellosaurus. So second ontology has disappeared. It was developed by an Indian company called Molecular Connection and it had about 500 cell lines and stopped, say, a fourth setting in 2013 or 2014. Now, there are some ontology which capture information on cell line. So most, I mean, known one is BTO, Brenda Tissue Ontology, which has information on about almost 2,400 cell line and it is a cell line which are used in experiments that are notated in Brenda on enzyme. So Brenda is a database on enzyme and it's annotated many paper on those enzymes. And of course, when a cell line is used in those paper, say basically, I mean, create an entry inside Brenda. It used to have a lot of duplicate erroneous, but now we have a very good collaboration. And so it's basically a feedback where it's sent what entries are creating and we send back if there is errors or things to be corrected in BTO. And then EBI has an experimental factor ontology which covers cell line used in the context of some EBI database or in code. And it's not very big, 1,300 cell line. And then I won't really mention, but just put them here, two other ontology which contain information cell line, BCGO and mesh, which is a huge ontology of medical subjects, but in terms of cell line, it only has 25 cell lines. Okay, so that's about ontologies and cell line, but the question is, we have to first go and start with the beginning. Who is distributing cell line? Because then this will give us an information of where you can find information on the cell line. Well, one of the place where people acquire cell line are what are called cell collection. So those are institutions that distribute from a few hundreds up to about 40,000 for the Corial Institute different cell line. And most of those cell collection are not only distributing cell line, but also distribute microorganisms, bacteria, fungi, and some even distribute vectors or plasmid and so on. So generally big entities and say distribute, I mean, so it's biological samples all over the world. Some, most of them are non-specialized, meaning they distribute all kind of cell line from a lot of different organism, but some specialize either in one category of cell line, like you have two listed here, EBISC and Y cell which only distribute stem cells, and some are specialized in an organism, like a DGRC distribute Drosophila cell line, TCB distribute tick cell line and so on. And most of them are global, some are specific to a country, but this is quite rare. Most of them are ship cell line globally. So here I'm not going to go through the list, but you see about 40 names from a lot of different country, USA, Germany, Italy, Japan, all over the world of entities which distribute cell line. So most one-on-one is ATCC, so American type collection, I mean cell collection, sorry, which is really, I mean, a huge entity which distributes, as I said before, not only cell line, but organism, microorganism. So a lot of people have heard of ATCC, but they're not the only one. Now, this is for those cell collection which are generally not always, but they're a non-profit organization which became very big, but they're basically set up as non-profit entities which distribute cell lines all over the world. And since our companies which have been established, which either establish themselves cell line or distribute cell line, I mean, from which were established by different laboratories. And here you have a list of company, I mean, ABM, ADEX Bio, I mean, you can recognize the very well-known company like Perkin Elmer, Millipore, Fuji, and so on. There is a huge number also of other companies which are the end list here which only distribute a few cell lines. Some are even created around one cell line. So this is their startup where, I mean, a scientist has created a cell line which I feel is very important and they try to sell it through a company. There is a problem with those companies, not those company which are listed above, but there are a number of what I call rogue companies or pirate companies which are distributing cell lines that they have acquired illegally. And this is quite a problem because of course people could say, well, it's okay. I mean, I'm getting my cell line much cheaper and from other company or from cell collection, yes, but they don't do any QC. And what you get is maybe not what you think you're getting. And so basically company are giving a bad names to other cell collection because people think, oh, well, I got probably this was an ATCC cell line, but in fact, they didn't get it from ATCC but by one of those company. And what they get is probably not really what they wanted. I mean, those companies are unfortunately located in one or two countries which are quite difficult for legal entities to, I mean, prosecute. I'm not giving to list names, but I guess most of you are intelligent enough to guess where those company could be located. But it's not obvious to know those company are there because they generally take American selling names and have website which are registered in US or in Europe. So be careful in terms of those companies. I mean, if you suddenly see a company which sells thousands of cell line and it's not one of the big one known company or cell line collection, be careful Chance RCC is one of those private company. Now, so biggest I would say way of distributing cell line is among academic labs. We all do this. I mean, we have a lab in Geneva and yes, what we did is when we wanted to have some cell line, we first ask our colleague, do you happen to have ELA? Do you happen to have MCF7? Oh, can you give us an alicot? And people do that, which is normal. I mean, share the cells between labs and some in fact, which have established a line distribute them through their transfer office or biomedical research facility. But most often this is done in an adult way and a lot of people are just distributing cell line which in fact, they got from a cell collection or from colleagues were not really checking if they could do so or if that's the right to do so. Most of them they don't. Most of them people close an eye on this practice. So you have people distributing cell line which get distributed again and again and again. You've got like this chains of cell line which have gone from, I mean, all over the world from one lab to another one. But it can be a receptor for disaster as we'll see later in terms of what peoples are using may not be what settings are used. Now, let's go now in term of what there is in terms of those cell line collection and what you can find and so on in term of information. So there is those 50 major entities which are listed in two slide or three slide a ways. So problem with those cell collection is not one in terms of all the distributed cell lines. They're very good at distributing them or shipping them, you get what you want but they're not really, I mean, well, I mean, I would say versed into good practice in bioinformatics. And that means that's a catalog or I mean, it's a website can change from one day to another one. You don't have any follow up of a cell line can disappear because it's decided not to distribute it but you won't find it anymore and you don't know why. And so basically something like five years ago I started drafting five minimal requirements for cell collection. So they should have one webpage for each cell line instead sometime of grouping, 10 cell line in one webpage. And it would be good that the URL they use for their page is based on the catalog number instead of a description which can change over time. And they should make people aware that they have new cell line or cell line that they don't distribute. And one thing very important for human cell line which I'm not going to explain just now because we go deep into that they should make what we call STR profile available. For a moment let's keep this as just terminology but we go back to that. Current situation, if you look at this color code, I mean from an Excel spreadsheet, it greens out it's been compliant, orange is more or less compliant and red is not compliant. And you can see that was the first requirement having individual page for each cell line is quite okay except for some cell collection. As you go into the forwarder's requirement it is really a problem. And so four to one, which is to tell people that they have discontinued cell line, none of them are really doing it well. So I mean generally having hide it very well such as stop distributing a cell line. Now getting information, you can get information on cell line from cell line collection but there is also some cell line database. In fact, in addition to cellosaurus as a universal resource, there is a database which is established in Italy, Genova called the cell line database, CLDB. It's really a high quality database. Really when I started cellosaurus I thought first that I didn't need to do anything because CLDB was really capturing data on cell line quite nicely. So it's two problem. I mean it's one problem is that's another thing it's since 2017, but even before so we're not adding any cell line for long, long time and so only have 6,000, as you see almost 7,000 entry describing a little bit more than 5,400 different cell lines. So that's tiny compared to the world of cell line. And I've discussed with the people which developed it. It's a classical problem. They had a grant in the early 2010, the European grants. The grant was over. So basically they couldn't really continue to develop the database. But there is a number of specialized resource. One which is really quite good which is a human pre-operative stem cell registry which caters for human ESC and IPSC. It's very high quality and it's a registry so it's based on voluntary registration of people creating new cell line to go on this side and creating entries which is quite nice. And they have already quite a number of cell line which have been registered. So problem is a lot of people create cell line and don't bother registering them. So that's one issue, a problem. Happily some journals and are forcing people which are creating those type of cell line to register them. So it's a portal also on stem cell in Japan called SKIPP. There is a database of cell line which are used for leucosyth antigens or HLA typing. So that's a very specialized cell line collection. And unfrontly, there is a lot of dead resource. They used to be a cell line database on prostate cancer, one on fish, one on insect and all of those are either completely dead. You don't see them on the web or they're still there but they have not been updated since they were first put on the web. So not a lot of resource specific on database, sorry, on cell line. But you do find a lot of information on cell line in all of the experimental portals or repository. And specifically, you have a lot of information which you can find on cell line for cancer cell line. And one of the big project is the cancer cell line in psychopedia which is now called the cancer dependency map from the Broad Institute. And in CCA, the deep map, you find a huge number of something now 1500 I think cancer cell line, which have been exome sequence which was a fusion, I mean, if there was a gene fusion that had been annotated, that transcription has been annotated, RNSE, a lot of experiments have been carried out, omics experiments on those cell lines. And all of the information is, of course, freely available and downloadable from the resource depth map. So there is a resource on drug sensitivity in cancer, one, another resource on cancer cell line that's in Texas at MD Anderson, the Sanger Institute as a cancer cell line project which is a little bit the same type of work being carried out in the Broad Institute in the depth map project. And but it has some cell line which are not in depth map and vice versa. And for IPSE, there is two initiatives, sorry, one initiatives. Sorry, I got my screen, which went one slide too far. So human IPSE initiative, which develops and capture a lot of information on some IPSE. So in addition to those specific experimental portals, which caters for cell line, you have a number of integrative portals which have been created, which integrate omics data from other type of collections. So you have cell miner, cell model passport, also at the Sanger Institute, the colorectal cancer atlas and an integrated resource of human cell line for identification. As you see, they have a number of varying number of cell lines. The first one I didn't put because it's NCI 60, 60 cell line. It's a set of 60 cell line. So the biggest one is cell model passport with almost 2000. So it's basically resource and pharmacodibi, which mine pharmacogenetic data sets. So what those things are, those resources are doing, they mine repositories, experimental repositories like RxPress, Geo and other repository for information on cell line. Most of them again are on cancer cell lines. Now you find also a lot of information on cell line in all of those resource, which are not cell line specific, like RxPress and Geo. So you open genome phenomarka, biosample, cambal, cosmic. So in C-Clip, you have genetic elements, so FCS-free database, YARC TP53. I just go through it. I'm not going to describe them. You can get those slide, of course, after the talk. So you can go back and look at those database and you can Google them and find them. And a lot of them you probably know like metabolite for metabolomics, pride for proteomics and so on. So there you have information on cell line, but most of them are not specific to cell line, with the exception of a few of them like the last one, TOQE, which is on transfection information for cell line. So I guess you can understand that what there is, it's a very fragmented and heterogeneous landscape of resource around cell line, which leads to the cellosaurus. And I will first go in one slide on cellosaurus and then go in more details in what there is in it. So you would guess it's a knowledge resource on cell line because otherwise, why would I speak about that for the last 20 minutes? And it has, no, I mean, almost 140,000 entries. Its scope is all the different type of cell line. I mentioned before, immortalized, naturally mortal. But it's a line. It does not cover primary cells because those are not cell line and not plant cell line, which are really a very specific set of cell line, which are not, in fact, inside the scope of cellosaurus. So you could see this as what is called animal cell line, from insect to vertebrate, but not, I mean, from plant. About 50 different type of information items. We'll see this later. And a lot of literature to reference and cross reference. It's accessible and we'll see this also on the web and you can download it in three different formats. And part of the content is loaded in the wiki data. This also I will describe to you in a few, I mean, after the pause, when we go and do search, I mean, explain how to search in cellosaurus. It's part of the resource identification initiative. I will explain to you what it is because that's quite important. And it's also part of the International Cell Line Notification Committee. Again, I will come back to that. I already told you about the human pluripotent stem cell registry. And so it's a lot of collaboration with many other resource and collection. It is what is called an elixir called data resource. So elixir, I mean, which is a European portal for, I mean, bioinformatics resource is basically selecting every year a number of resources that it terms core data resource, which are important, I mean, worldwide in term of bioinformatics resource. And last year, the cellosaurus was selected as the core data resource. It's also a near-dick recognized resource. Since maybe you know less, it's, I mean, a group of people working on rare disease, it's an international rare disease organization and say, give a stamp of approval to database, which cater, I mean, to for all the information which is useful for people working on rare disease. And of course, a lot of cell line in cellosaurus are derived from patients, a fibroblast or other tissue of patients suffering from rare disease, whether it's rare genetic disease or rare cancer. OK. So if I want to go again in one slide in terms of history of cell line, I already told you that cellosaurus started as a control vocabulary for cell line of next product. And then what's the first time it was called cellosaurus and became available first on the next product. FTP site was in February 2012. So exactly 10 years ago. I mean, in fact, I didn't realize. I certainly realized that I missed them in for a few day by a few days. I think it's a birthday, because I think it was on 10th of February, 10th birthday. And it had that system cross-reference to 21 resource and 1000 liter to reference and 8 information fields. And then it grew. But what was important in 2015, it became available on the Xpazee server where you probably have used it because that's the main place where you can get access to Xpazee, the other source where you can get it. But the main one is on Xpazee. And then in 2016, it became the cell line resource for this resource identification initiative, which I will explain. In still in 2016, it's continued to grow and it was with STR profile against something I'm going to explain, which were introduced and became a member of the ICLAC committee, which I can I will explain. And then in 2006, still a distribution XML, which is quite important for people wanting to pass and extract information from cellosaurus, a paper on cellosaurus was published in 2018. And a tool called CAST, which I will describe, also was developed by Thibaut Robin in 2019. He became in 2021's Elixir Core Data Resource and a near the recognized resource. I should add also that at the end of 2021, it became funded by the Swiss Institute of Bioinformatics, which allocate, no, I mean, it's equivalent of one developer position, which will allow us to develop quite a number of things in terms of software development around the cellosaurus. Now, all to get information in cellosaurus, even if I didn't describe yet what you find in it, well, I can tell you already it's a different pipeline which allowed to get information in cellosaurus. In fact, there is four of them. It's basically extracting data from cell line collection, which have product page. I already told you they're not really well organized in terms of bioinformatics, so their page are generally free text. Some of them are structured, so it's possible to write script to extract information. Most of the time, I mean, when I contact them, they send me Excel spreadsheet. And yes, it's possible to send to get information and it's to extract, but it's a hassle. But it's possible to extract information from cell. Extracting data from external bioinformatics resource that caters for specific data relevant to cell line, like sequence variation or HLA typing and others. So that's generally easier once you have defined what you want, because then those resources are generally much better organized in cell line collection and you can already download files that exist. If you take Cosmic, it's possible to load the Cosmic sequence variation file or the depth map, also somatic mutation page and so on. And you can extract whatever information you need. But in fact, also one of the main point of entry is manual creation from publication. So it's about 23,000 publication, which are cited in cellos. There is every publication in the cellos, which is cited as being in fact created. It doesn't mean the publication has been read completely because if there is only one small section on cell line in the publication discussing other things, the rest of the publication was not read, but everything which was on the cell line, yes. And all of them are saved in PDFs so that we can go back and get some information. So apart from three or four publications, all of those 23,000 publication are the PDFs stored. I mean, locally. And one thing which is quite nice is we get more and more submission from people of information on cell line. And I saw, as we were waiting for the course to start, that we have somebody, Petar from Croatia, and I was just Petar's reticope, I'm pronouncing your name correctly. And I saw that you have an abstract on generation of knockouts melanoma cell lines. So I mean, it would be nice to get some information, even if it's pre-publication on those knockouts melanoma cell lines. And we get more and more submissions. That's nice. But if there are people here which have information, please do not, I mean, hesitate to submit information. Now, in fact, in term of all to get some information still on manual curation, so where to find the papers? That's not so complex. I mean, it's hydro paper cited in cell line distribution documentation, permit and Google search, Google scholar alerts for people which are born from attic, born from attrition. You may know that permit has something called a lit suggest, which is a nice AI driven system where you can basically give examples of articles you're interested in. It gives it basically does machine learning approach. And every week, give you a list of publication, which seems relevant to what you give it as criteria. So it works well. So of course, we retrieved the full text. And as I said before, we retrieved more than 99% of the paper that were cited, stored them locally and so on. And of course, reading part of the information. A lot of case, it's also contacting authors to clarify points which are not clear publication of to ask additional information. This is quite difficult to get response after pestering people up to three or four times stop. But after three email, I generally do not get over 50% response. OK, let's dive in inside what you have in Celozorus and we start with name and identity here. By every database needs to have names and identity fears. And this leads to a problem about cell line names. So we'll see this and names and synonyms, misspellings. We'll see accession numbers. And here I will explain what I already mentioned twice, which is a resource identification initiatives and something called error ideas. So names. It's a big problem is that short cell line names are a disaster, which doesn't stop people from using short cell line names. And here you have 10 names which are used by 37 different cell lines. And this is, in fact, I mean, only the tip of the iceberg. And you have, I mean, almost in a dozen sets of cell line, we share either as a name or a synonyms with another cell line. And it's basically really problematic because if you just see it in a paper, without knowing the context, and even sometimes if you know the context, it can be difficult to know which cell line is being described. The problem is when people propose longer names, sometimes, I mean, what researchers will do is abbreviate them. So you have, for example, a cell line called FG2C3A, which is OK. And then, of course, everyone starts calling it C3A, which is not OK because it's too short. It confuses with other cell line. It confuses with the gene name and so on. You could say, well, maybe what would be nice is to have nomenclature rules. There have been two which have been proposed, one for insect cell line in the 1970s and which is not really followed. And one which is starting to be followed for pure potent and stem cell line in 2018. But mostly people do what they want, and it's a disaster. And, of course, in addition to short names, there is a lot of misspellings in the literature. Now, this leads to, I mean, what we capture inside cellos, we capture as a name of the cell line, the one, I mean, which is the one assigned, I mean, by its creator. Or if it's, I mean, basically, generally it's the one which was first published, except in some rare case where that name was never used anymore and another name was used. And basically, I mean, oh, yes, also if it's problematic, if it's really too short, we try to use a synonym which is a bit longer as a I mean, it's a recommended name for that cell line. So casing is important. So you can have two cell lines which have the same I mean, name like Bob and Bob, but they're considered as different names because the casing is different. So because of what I was telling you, that some cell lines share the same name with others, what we're obliged to do in cellos is post fix identical name with a small textual description. So here you see two cell line, which are called AML 14. One is a human leukemia cell line and the other one is a mouse liver cell line. So they both go, I mean, are called AML 14. They don't have any synonyms which allow them to be distinguished. So what we have to do is to post fix the name with a small description, which allows people to immediately see that this is AML 14 in the context of human leukemia and sort of one is a mouse liver. So we try also to capture as many synonyms as possible. And here also casing and punctuation as important. We capture abbreviation, but also expansion of the name. So here you have ILA MCF 7 and MCF 7. An expansion of MCF 7 is explanation why it's called MCF 7. Michigan Cancer Foundation 7, or ILA being Arietta Lacks cells or CDHL 4, Stanford University Diffuse Histostatic Informa 4. And here you see many different variation on the term of pseudo DHL 4, even going to SU 4 or DHL 4 and so on. So all these are captured as synonyms and misspellings are captured also. So we record misspellings and we record it if it's specific to a publication or a database entry, like here, this misspelling is in this publication, this misspelling is in cosmic. And here's accession number of the cosmic entry. And in some case, we did mistake in cellos or us. So we indicates I will misspelled, I mean the cell line and from which release to which release it was, it was misspelled. So continuing on cell line and the names and identifier, every entry as a unique stable identifiers accession string, which in the format CVCL and send a letter, another, I mean, sorry, another letter or numbers and it's followed by a letter or number and so on and you have like this, I mean, a four, I mean, alphanumerical characters, which follows the CVCL. Some people ask why CVCL? It used to be because it was a CV for cell lines, the control recovery for cell line. Of course, it's much more than that, but it was a logical to have CVCL. Sometime you need to merge two or more entries. When you see it's the same cell line, it happens that you like looking at publication, you don't see that two cell lines are identical because they're using different names. And then you find suddenly someone telling you or you find a publication which says, oh, by the way, the cell lines that we named such and such was in fact named something else in different publications. So in this case, you have secondary accession numbers. And it happens quite rarely that an entry can be deleted if it's not a cell line and it was basically didn't exist. So that's quite rare. Now, I was promising you to explain to you what is the resource identification issue. So it was introduced something like, I would say eight or nine years ago, this concept of having research resource identifiers. So for each resource which is used experimentally, like antibodies, cell line, organism, or even things like software tools to have for each of them a unique stable identifier, which is called an error ID. And basically the idea was that if people started citing this in their paper, then people could basically reproduce much more easily experiments. So there is, I mean, this identity field like this for antibodies, for organism, for strain, and for cell line, it is a cellosaries which has been selected by the research resource identity field, I mean, initiative as a resource for error ID. So basically what happens in papers, what you see is when people are citing those error IDs, they prefix it by our ID and send, so it gives accession number of the database which is used to as a resource identification identifier. So for example, when it's an antibody, it's using the antibody registry. So it's prefixed by AB and it's CVCL, cellosaries accession number for cell line. And you can search and display all those error ID at the portal, which is on the cycle site branch. So here is an extract of an abstract, not an abstract, sorry, of a method part of a paper in ELIFE, the journal ELIFE, which correctly indicate which cell line serve you. So this says the raw 264 miles macro phase cell line, so it gives the error ID obtained from ATCC and so on. And so explain basically what cell line they have used and give the catalog number or not, but at least the place where they got it, like ATCC and so on, or someone like this one was provided by John Masagge from Memorial Sloan Kettering, but you have like this, all of the accession number of the cellosaries. And of course, if you were looking at monoclonal antibody, you would have the error ID from monoclonal antibodies and so on. Now, so that's quite useful. And it's useful if people are using it, of course. If there is error ID and people are not using it, it's basically failing. And I think I'm optimistic that it's working because you see here, I mean, we started being recognized as a resource for error ID for cell line at late 2016. And the number of paper which sites error IDs for cell line, meaning cellosaries accession numbers is increasing every year. And it's now almost 2000. It was a 1900 something in 2021. I guess in 2022, it will be more than 2000 papers. It's tiny compared to number of papers which are using cell line, but the trend is going in the right direction, which is, I mean, nice to see. Now, it's important to say where a cell line comes from. And for this, I mean, what we mean by this is which species it comes from. If it's a non-human cell line which breed and subspecies, if it's a human cell line which population, and for human cell line will also sometime have information on the genomic ancestry, I would explain all this. HLA typing, the sex of the donor and the age of sampling and anatomical origin. Okay, let's go into all of those. Species of origin, it's easy. We indicate it using the NCBI tax ID. So if you have a human cell line, you have a tax ID of Homo sapiens. Now, you have some cell lines which are hybrid, one big category of hybridomas, which can have more than one species of origin because they're hybrid of more than one cell and of course the cells can come from more than one species. So we have about 900 entries which come from two species and even just a little over less than 10 entry nine which have three species. And in this case, you have as many OX line as you have species. In a few case, we don't know where the cell line comes from. So I mean, we use the tax ID for unidentified which is defined at NCBI when people don't know what species they're using. So species, not really any issues. Breed and subspecies, this is difficult to standardize. So it's free term information. It's sometimes difficult to know which breed of animal or an insect or substrain or subspecies have been used. It's not always indicated. So currently we have about 820 different values for about almost 20,000 animal cell line. So here I give you five examples of what I mean. One for mouse, C57, black six for dog, for bovine crossover of Ereford with limousine, zebrafish which is glowing in UV light, glowfish and the famous drosophila strain, Oregon R. So there is no database for really capturing this. So this, as I said, is free text. Now population information, it's a mine field because people sometimes speak of population. Some cell line or some database speak of ethnicity. Of course, nobody use the word race, but if you go to old papers, this was listed. You had list of cell lines saying which race it was from, I mean, white, black and so on. And of course I thought basically in cell other ways we should not really know down where a cell line come from which population. But in fact, it seems that the fact that there is this bias that a lot of research or medical research is done on what people know are called Caucasian, European population. Then it means that there is not enough data on other population. Thus it's useful to know down which cell lines come from different type of populations that people have access to models which are not the standard wasp, white, Anglo-American type of population. So this is now captured in cellocerous with a number of terms which are used by a lot of different database like Caucasian and so on. But it's quite difficult because I mean those definition are really vague. And of course when people are mixed to, mixed to a lot of different population, it becomes a nightmare. But still you have information like this, African Somali, Caucasian, Swiss and so on, which are put in data in cell line if it's recorded. But we don't hunt specifically for this information. If people have provided it, we put it, but otherwise we're not going to ask for this information. So now about 35% of human cell line entries have this information. I don't think it's going to grow a lot more because a lot of cases this is not indicated and that's fine. I mean you have, I mean already 35% with this population information. So there is a group in US, I mean in Puerto Rico and in California, which has developed, which is led by Julie Dutille, which has done analysis of cancer cell line to look at what is the ancestry of the cell line in terms of seven populations. So they took 1,400 cell line without the exome and also when they looked at the 1,000 genome project where the donor population is known because by definition those were done on people from, I mean, where it's a population where they came from was known. And from this, they consider seven ancestral population and since they make a, add mix to a computation and give you this type of information like for illa cell line, it's African at 64.74%, 0.77% Native American and so on. And you have basically like this genome ancestry of this cell line based on the 1,000 human genome project. Now we capture also if it's available information on HLA typing. I'm going to go in a lot of details because this is very specific and people which work on it knows that we have this information and this is shown here, HLA typing with HLA type one or type B2 and so on which is quite complex for people which don't know what it is, but so people which work on HLA typing know exactly what I mean by this. Now, one thing which is of course very important is to define which sex a cell line comes from. So we have a control vocabulary with only five terms, female male mixed sex when there is more than one animal which was used to create a cell line or sex ambiguous in terms of people having disease of like extra X chromosome, extra Y chromosome and all of the different type of sexual identity at the level of chromosomal defect. And of course for a lot of cell line we don't know so we put a sex unspecified. So already described mixed sex when it's used sex ambiguous. In some case we may know later where the sex comes from which the cell line comes from but for some of them we will never know. And of course hybrid cell line and every DOMA are not annotated with this information because it would be quite complex as it's from two different cell lines that we find with the sex of a cell line which is an hybrid is a little bit problematic in term of definition. Age of sampling is quite useful to know at which age the cell line was derived so it can be a precise value in day or even week, month, year or range. It can be someone between 50 or 55 years and it can be a developmental stage in some case especially for some animals like insects it will be saying fifth install level stage or a lamp for sheep and so on. So it's basically for a moment it's text and if it's not known it's age unspecified again hybrid cell and every DOMA are not annotated about 90% of the entry contained information on age at sampling and just for your knowledge you can use it as an anecdote. So oldest donor which is registered inside cells for human cell line was 114 years old when the cell line was created it's a series of cell line from Sardinian centenarian and so oldest one was 114 years old. Now the last thing about where cell line come from it's anatomical information is too common field called derived from sampling site and derived from it that's a exotic site. And currently it used to be like this with just text and like codal fin bone mandible and if not with information on organ and tissue but not the cell type and no ontology links to it. Now the thing is first to retrofit this textual information in all of century where we can and it's about now 80% down so 70 to 80% will be done at the next release and since the next step which I will describe later is to to link this textual information to the uberoanatomical ontology and also to the cell line not the cell line ontology it's a cell ontology not CLO but CL which describes the cell type. So to give the example of this bone jaw mandible term it will be linked to uberoanatom which is uberoanatom for mandible and if this cell comes from an osteoblast for example it will also be linked to the term the cell line ontology sorry cell type ontology CL to the cell type for osteoblast. Okay so this is a work in progress now something very important is cell line integrity and this is the issue of cell line contamination or misidentification and I will speak about ICLAC annotation of chromatic cell line and tell you what is this STR profile and what you can do with cell results now cell line contamination is a huge problem many studies have used the wrong cells and these are titles of articles it's not me which is saying with this it's all you know titles of articles which are saying things like cost of using an authenticate cell line it's costly it's a dirty little secret of cancer research and so on. People estimate that about 30% of cancer cell line which are used in literature are contaminated which means that people are using one cell line but it's not the cell line which the things are using now sometimes it's not a big deal and sometimes it can be catastrophic what I mean by this is let's say you're using a cell line just to express a protein it doesn't matter if the cell line come from breast or prostate as long as you manage to over express or express your protein if using as a factory to do something doesn't matter if you're using a cell line because you're studying prostate cancer and you're using a breast cancer cell line then you're in deep trouble so it can go from not really important to having your papers having to be I mean invalidated completely so a few definition in term of what we mean by contaminated you can have contaminated cell line because you have either foreign cell line which have I mean basically over on your culture you are the breast cancer cell line and certainly prostate cancer cell line which you have somewhere else in the lab I mean started growing in your petri dish and you basically I mean lost your breast cancer cell line in aggressive growth of this other cell line or you can have I mean contamination to microorganisms whether it's macroplasma and fungi since I'm not going to describe because this is another problem it doesn't change which cell line you're using what you think you're using it just makes your life a bit difficult because you need to clean up your cell line you have to basically suppress this contamination it's not a problem for people growing cell line but it's not a problem in term of papers and results now misidentified cell line it's when you think you have a cell line from one species or gender but it's not the right one you think you got your cell line from a dog but it's not from a dog it was from a cat or something like that and in some case you have what we call misclassified or even the disease you thought you had a you know a cell from one tissue and it's from another one now here you have a joke which was inside one of the tips article 10 years ago where in fact yes when people suddenly in the seas that have a white chromosome it's a illa cell which is a female cell lines are in deep trouble now this is something which happens quite a lot and there are papers reporting when it's found out what happens is people generally write articles about it and here you see three different examples bladder cancer cell line which is not a bladder cancer cell line but which is illa somebody which thought they had a cell line from one monkey which is a itiops but it's not from that monkey and someone which thought they had a cancer cell line from a it was another cancer three different type of problems all of them which can be really bad so a few years ago I mean well 10 years ago in fact at the same time almost that cells who started the international cell identification committee was founded and it was founded by people working on cell line collection and which were involved in trying to clean up this problem to really promote I mean the work of finding those errors a new cell line makes them more visible and combate you know basically fight against this problem of problematic cell line so they create a register of misaligned cell line and each candidate once this register is carefully praised by the committee to go back and ask people to go back in fridge on 20 year old sample or even sometime 30 year old to do and re I mean sometime analyze some frozen samples to see what happened to try basically it's a detective work and one of the person doing this Amanda Capes Davis or Monika on Twitter is cell detective so if you look up for cell detective you'll see that there's a lot of information about this which is one of the person involved in Iklaks since the beginning now a lot of the new candidates identified through the process of creating cell line information cell also has been doesn't change the process once those candidate goes to Iklaks they need to be studied and so you have a website Iklak where you can find those problematic cell line with a common field called problematic cell line and here are examples contaminated shown to be illa derivative misidentified originally thought of being of human origin but found to be from pig misclassified originally thought to originate from neuroblastoma but is from an avian sarcoma and so on so you have this information which are derived from a problematic cell line so sometimes some people take a problematic cell line they don't know it's problematic and so derive a new cell line but basically if it's already contaminated it continues to be contaminated and so contamination goes from parent to child on the cell line and you need to indicate like here contaminated the grandparent cell line KB has been shown to be illa derivative so I mean all of the problematic cell line are linked to the registration comment if there is one in Iklak with the number of the register Iklak register so there is a little bit of one thousand cell line which are noted as problematic and you can browse them through the main menu of cellos and expases there is browse problematic and when you go to one of the cell line it's in fact in red text which tells you that it is problematic so you cannot miss that it's a problematic cell line now what can we do to stop this from happening now what people have recognized already almost 20 years ago is that you can have you have low side in human genome which are highly polymorphic which are small repeats of DNA which can be sequenced quite easily and which are very similar to all like cross population or even in between individual of same family of course if they are from same families there will be similarity but it will basically drift and you can have a change from one person to another one and those are used in forensic identification and pattern testing and in fact they can use be used to ensure the quality integrity of human cell line so in 2011 a first standard was published which said that they could use eight different low side eight of those sites on the human genome plus one I will explain what afterwards a plus one and now in today last year there was a revision of this NC standard with 13 low side plus one you can buy kids to test your cell line generally those kids have 18 low side because that's what's used by FBI and by paternity testing so people are all using the same type of test so you have to do it and you have those low side which are very barbaric name which are positions I mean inside the human genome on chromosome 13 sometimes it's a gene name near a gene name you can recognize CSF1 and TPO and so on from Vilebrand factor and one of them is special because it's supposed to be from female or male if it's a male gene in from female only or from male and female and a white chromosome so it's too early unfortunately it doesn't work well because white chromosome can be lost in a lot of cell lines so and people have not developed ester marker from mouse and dog cell line so what people can do they can do it themselves or send to company they give a little bit of DNA from the cell line the company extracts the DNA run I mean sequencing kit on those primers for those different low side and give you the results of which allele is found a number of copy of each allele 15 copy of the D8 CSF1 and 16 and so on you can have multiple of course you can have one or two because people have two chromosomes but if it's a cancer cell line you can have even more because you can have a lot of chromosomal duplication but generally that's what you get with a normal cell line peaks either one peak or two peaks depending if the person is heterozygous in one of those low side and once you have them you can use them to compare with other cell line problem is a lot of company keeps a database of ester profile private so what happened is that in cellosaurus we capture if it's available in publication in cell line collection those information we now have almost 8000 of cell line which have this information and most of them are for human because the mouse and dog ester profile is only starting but this will increase especially for mouse and we indicate where we got this information from it can be one side or it can be a lot of different side here we got information on the ester profile of this cell line from sorry 7 or 8 different place publication and cell lines database and we annotate conflict between different source what I mean is here here you have a cell line with 5 different source 5 different 4 different cell collection and cosmic cell line project which have done the ester profile of this cell line and they agree with 7 of the different low side but for amelogenine one says it's X, Y and cell are only X again as I told you it's because cell line often lose a Y chromosome and here RCBs a Japanese database things that there is 3 peaks in this position while the 3 others think there is 2 peaks it can be an error or it can be just slight variation on cell line when it has been grown for example here in Japan it's a cell line collection while other cell line collection have grown less and this one has acquired a new mutation or sometimes it can be a little lost so you have sometimes a small variation sometimes it can be errors and what can you do with this we'll see this later after a pose when we can search for those ester profile with a tool which is called class let's change subject and go on different information properties of cell lines that you can find and here I list again 6 different type of category of information which I will describe I'm just looking at the time to see where I am in terms of time before pose and maybe to slow down a little bit to do things too fast now in terms of those category of cell line properties the first one is easy to explain it's a category we define 14 categories of cell lines transform, cancer, embryonic hybridoma you recognize like this a number of different type of cell line and each cell line is currently assigned to only one category which can be a little bit an issue because you have cell line which have both transformed and telomerized immortalized so this may change in the future but currently it's easy a cell line is assigned to one of those 14 categories and in terms of statistics you can see that the biggest set of cell line are transformed followed by cancer, embryonic induced this one is growing quite nicely it will overcome embryonic stem cell very quickly and it will even go above cancer cell lines in a number of years because everyone is creating induced preferred stem cells now why is transform cell line a category because you can take blood, take the B lymphocyte and use EBV to immortalize them so it's a nice way of immortalizing cells from patients and people which are working on genetic disease are creating a lot of those cell lines in Coriel society I was mentioning with the biggest cell line collection out of the 40,000 cell line in Coriel I would say 30,000 are EBV transformed cell lines so that's the reason why you have so many transformed cell lines now characteristic in biotechnology this currently is free text and it's like a little bit of text describing cell line like use form for study of drug delivery and so on it's only taken from publication or from cell line collection it's not meant to be a long text just one or two liners telling you something about the cell line and sometime it's on biotechnological use of the cell line like MRC5 which is used for production of a number of deployed cell vaccine if you were to go on the very cell line you will have information for example on the fact that it's used for growing the cell scope to provide the vaccine now another field is doubling time so this is captured mostly from publication sometimes from cell collection and it's basically capturing what people say is doubling time of the cell line sometimes there is additional information like the fact that it's been recorded at seventh passage sometimes the media is important for doubling time but basically it's easy to all the temperature like here whereas it's 21 days of doubling time at four cells use and 84 days at one cell use and about 7000 cell line contains such information a lot of people don't record this information when they publish information on cell line now carrier-typic information this is a recent text free text command line to capture information rather than to capture of a cell line we don't want to capture all of the character-pick information but some noteworthy features so you have things like has lost chromosome Y or like this one is e-pair pentaloid carrier-type as almost twice as many chromosomes as its parental cell line and so on so information which can be used for in terms of carrier-typic information macro-stability macro-satellite instability so that's more important for color rectile cancer cell line but other type of cancer and this series of control vocabulary because cell line can be called stable or MSI low MSI in stable high so you have this information and where it was obtained from either stable in stable low in stable high and only 1400 entries with such information but because mainly it's color rectile and other cancer cell line but not all type of cancer ok let's go to another field which is information which basically has to do with all the cell line is engineered or it's transformed or it's transfected and transduced or there are no codes on the cell line or it's selected for resistance and other type of engineering so one thing is as I said before you can transform a cell line using a variety of ways I told you about using EBV but people use also SB40 so virus but you can use chemical carcinogen you can use radiation, UV or cobalt radiation and so on to immortalize a cell and to quickly indicate which transformation method was used using a number of database because things like radiation is not found in same database as chemical and virus so you have things like K-Bi for chemicals which they use this chemical which is a mutagenetic chemical has been used to or it can be a tax ID of a species like Epstein-Barr or an information in NCI-Tesaurus or radiation and so on so different database for a different type of three different database for different type of methods and this is used both for artificially transform cell line and for cancer cell lines that have a reason through viral carcinogens now transfectional either using chemical carrier or viral vector so it's basically transfected so information even so that it could be it's both transfected or transduced and again you have a database, accession and description here you have information on an entry a cell line which has been transfected with K-RAS but with a mutation in K-RAS here it's transfected with a mouse gene here it's transfected with the codon, optimized codon usage for mammalian expression for moment we're using transfected for both transfection and transduction so if you have an idea sorry of a more appropriate term we are client for a term which could replace what's transfected and transduced I mean to make it more obvious that both type of things can be done now another thing is knockout you can knock out a gene and again I'm not going to go into a lot of details but you have the method CRISPR, talent K-O mouse and which gene has been K-O inside the cell line you can select for resistance again same type of format a database like K-Buy for drug drug bank for big drugs like monoclonal antibodies because you have a number of cancer saline which have been selected because they're resistant to monoclonal antibodies which are used as anti-cancer drugs and sometimes it can be even against a toxin which is a protein so it's selected to resistance to uniprot I mean we're using uniprot accession because it's I mean a toxin and can even be radiation also so we have about as you see 13,000 cell line with 235 different compounds which are used to select the cell line and also we will describe this you can have cell line where one position in a sequence has been edited especially using CRISPR Caspina and this we indicated in sequence variant in field where something can be either edited or even corrected if there was a mutation for example in cell line for a patient with a mutation you can have this mutation corrected using CRISPR Cas9 or zinc finger nucleus and so on so this I will come back when we speak about sequence variation cell line groups and panel I will go quickly on it it's simple so one of the groups is that we group together some cell lines they can be retrieved easily without doing complex queries so we have groups for a lot of taxonomy groups like amphibian bird cetacean crustacean fish insect and so on so if you want all fish cell line you don't need to think oh what is fish in the taxonomy of NCBI and try to do a very complex query which you cannot do anywhere currently I will go back to that but you can do it by just asking for that group of fish cell line and you have other groups which are more specific to the use of a cell line in some way like a new group which was introduced a year ago which are cell line which are used in SARS-CoV-2 research context and even some very specific categories of cell line like those which have been flown spacecrafts those which are from species or breed which are endangered and so on and if you select this it's easy you basically, oh yes endangered species is going to be something more and more useful because people are more and more doing IPSC for a species which are threatened with extinction and things like here this is a cell line which was established from a monkey which live in San Diego Zoo which died and it's an endangered species of monkey and you can basically browse inside the website by cell line group and so same thing is cell line panels which basically are groups of cell line which have been defined by different scientists in the region like NCI 60 cancer cell line from the National Cancer Institute and we have identified almost 110 cell lines panels and again you can query, so it's browsed by cell line panels and then you get an alphabetical list of 110 here it's always a beginning of same alphabetical order and you can say, oh I want all of the cell line by the 90 plus 30 cell line a collection of cell line of people above 90 year old or the SAF so that you do polymorphism in Paris amish pedigree cell line collection or the Canadian Alzheimer disease kindred sub-collection and so on. Now let's go to I mean something quite important which is disease information and I will normally I will maybe stop just after that and do suppose even so we will have still one or two things to describe in terms of contents but I'm sure you're all tired I'm tired and it's some water also so we'll stop after those nine slides here so annotating disease with I mean NCIT and Orphanet I will explain what is NCIT cellosaurus value set I will explain that and how do we annotate sequence variation okay so disease I told you already that a lot of cell lines come from disease individuals and so to annotate what disease the person suffering the person who was a donor of that cell line was suffering from we use either NCIT orphanet orphanet and if both exist we use both inside what cell line entry so NCIT people ask why did we use NCIT because to our knowledge it's the only disease ontology which caters not only for human disease but also for non-human disease so we wanted to have also terms for dog, cat, rat whatever you want cancer or other type of disease and it's extremely useful and responsive to create new term when the term is missing so they have created almost 500 terms now for the cellosaurus just because those terms were not yet in NCIT and orphanet although it's because it's basically for human rare disease and that's the standards of annotation for human rare disease now so we annotate to both of them of course a lot of cell line do not have orphanet terms because they're not either human cell line so they cannot have it also not rare disease if you have a cancer a cancer cell line you won't have an orphanet term because it's not a rare disease lung cancer unfortunately so you will have orphanet although only for rare cancer in human so half of the cellosaurus entry is annotated with at least one disease term which is normal because we have so many cancer cell line and so many rare disease cell line and we use over 2,000 NCIT term and above about 1,200 orphanet terms in cellosaurus now we don't annotate currently cell line arising from carrier of a genetic disease with disease terms what I mean by this if somebody is suffering from cystic fibrosis yes the cell line will be annotated with cystic fibrosis in order in NCIT but if it's a cell line from the parent that is coming from this disease and who has one copy of the gene with a defect but no sign of the disease we don't currently annotate the disease term this may change and we may change completely the way we annotate disease by annotating if somebody is carrier or something like that and also if but that's normal if a cell line originate from the known cancerous tissue in a cancer patient we don't annotate with disease term but that's normal if you have a cancerous patient and you take for example of a lung cancer patient his or her skin cells there's no reason to annotate this as a lung cancer because the skin cell are not touched by it except if there is of course a metastatic tumor and cell line which are edited to correct the disease causing mutation are annotated with the disease term even so they lost the disease so this may change now in terms of what I was telling you about the NCI disease terminology what happens is NCI when they collaborate with a group they create what they call a value set they call the cellosaurus disease terminology it's not correct because it's not the terminology of the cellosaurus it's a subset of NCI which is used in cellosaurus but it means that from the NCI the cellosaurus you can download all the terms which are used in the cellosaurus from NCI now an important things and we still have four slides before the pause it's sequence variation whether it's a genetic disease causing mutation important oncogenic somatic mutation gene dilation amplification and gene fusion so we try to capture this using the HEVS in the manclature and when a variation exists in clintvar about 50% of the variant we describe we link back to the entree so here you can see a mutation in the gene B2M with the agency I mean cross-reference so mutation description it's almost igus, it's a paper and here no cross-reference to clintvar because this mutation is not in clintvar this one for example in TP53 as a clintvar link, sorry this was not meant to click on it and you basically have the link to clintvar here you see a gene fusion between two, oops I should not click and so basically you have different type of sequence variation which can be annotated inside cellosaurus they can belong to four categories amplification, deletion, fusion and mutation mutation being the biggest set of variants which are annotated currently here are examples for each of them a gene amplification here it's in syn-nuclein which is amplified a lot in Parkinson disease patients a deletion a quite common deletion in the gene deleted CDK cell cycle which is quite often deleted in cancer saline a gene fusion one of the most well known gene fusion BCRABL called philadelphia chromosome gene fusion and a mutation a normal I would say standard mutation here it's one base per but it can be many base per with zygosity indicated now about 20% of voluntary content at least one sequence variation it's about 6500 variation on 1200 genes most of them 90% are on human saline because that's where people are of course more interested in and it's a work in progress because we have not yet caught up with retrofitting cancer saline with some oncogenic mutation so we'll start now with I will try to go quickly to more things more slowly to things which are more important first thing here will be fast because this is on hybridoma I think that a lot of you are using monoclonal antibody but you're not really working with hybridoma producing those monoclonal antibody you buy some monoclonal antibody but not so cell producing it so what we annotate is the isotype of the antibody which is produced by the hybridoma and to target and that means that basically I mean it's if it's isotype it's information like type of immunoglobulin and type of light chain like here Hg1 Kappa Hg2a Hgm lambda and the target which is of course the important part it's basically either Kb or uniprot if molecular entity is well known it's defined it's it's like a target here it's producing monoclonal antibody recognizing this human protein with a uniprot accession also Kb accession if it's a chemical or it can be free text when it's not defined what is a molecular target of a monoclonal antibody so either structure or free text and now there's about 7000 hybridoma with target information okay here I would go so quickly on these fields which are a number of miscellaneous fields which are some are more or less important than others and I will go through it quickly registration is to indicate in which type of registered database has been maybe entered like the ICLAC registered of cell lines which are contaminated people maybe some of you work on patents and you know that if you patent biological object whether it's bacteria or cell line you need to deposit this cell line or this bacteria into an international depository authority like ATCC in the US having different authorities in different countries and so give you a number of registration number registration we have in Switzerland for embryonic stem cells there's a registry from the Swiss Confederation of cell line which can be used in research and so on and you see different type of registry other fields is discontinuation that's important it's to tell that a cell line is no longer distributed by a cell line collection or company so this we try to indicate and when you have the cross-reference to the cell line collection you cannot click on one which is called discontinuated and you see that the cell line used to be sold by ATCC or by ICLC but it's not anymore now as I said companies or cell line collection don't do this and so it's really a nightmare to try to email them saying can you confirm that you really are not distributing anything is this cell line or not and sometimes you know we're doing it but we forgot it from our catalogue or yes we're not doing it for the last two years and so on so it's in fact it's the only place I think where you can find this information currently in the world in cellosaurus because nobody else is keeping up this information on discontinued cell line so as you see there's many discontinuation information doesn't mean 8400 discontinued cell line because as you see one cell line has been discontinued by two cell collection and not by others and so anecdotes I mean as it says it's anecdotes so I won't go we have too many things to see today so I won't you can look at this in a presentation and it's basically sometimes by reading some papers I stumble on things which are anecdotal and I feel it's urged to report it because I find this interesting I mean it would be otherwise too dry to only report things which are important scientifically sometimes history of science or sociology of science is interesting caution it's all kind of issue which do not directly affect the integrity of the cell line otherwise it would be problematic and it's in the caution field that would be the example of the question which was asked TP53 mutation not in the same place I forgot to give this example TP53 mutation indicated incorrectly here we know that it was incorrect but in some cases it would be saying there is a conflict between two publications here it's known that this publication was at an incorrect mutation or here this was indicated to be from a three-year-old female patient and from a nine-month old in table one of same papers so it's not the same age depending on part of the paper so we try to get them resolved to try to find but sometimes they stay and here this one we don't know if this cell line SD4 is a misspelling or SO4 so same type of properties of description but it's all papers from the 1990s which nobody, no groups is working on cell lines anymore and omics field is to indicate what type of omics experiment was carried out on a protein like proteomics transcriptomic SNP array so a lot of different type of experiments can be carried out incomplete genome of cell line can be the case and finally entry story which tells you once entry was created and when it was last updated and the number of time it was updated some cell lines are updated almost at every release because I'm used up to putting at every release and some less here's I didn't put this up to date since release 37 you see that some cell line only updated once or twice but many are updated four to seven times since it was created and then you have a long tail of cell lines which are updated almost like every few months now important information child parent relationship sister otologous cell line relationships parent child relationship is easy to explain when a cell line is derived from another cell line it's a child of that cell line who have a hierarchy statement derived any noble it's called derived from so for example the cell line mcf7 tax r is derived from the cell line mcf7 easy to understand and yes I read cell line can have more than one parent because I agree and they can have two, three or four parents so about the two of the cell also sentry describe cell line derived from another cell people have derived thousands of different cell lines from mcf7 from others from egg 293 there's probably more than 2000 egg 293 cell line derived and you can see here in the entry to derive cell line all of the child of a cell line and when you go to the cell line itself can have of course a parent but it can have a child itself so you can have more than one generation here is a nice graph using wiki data of mcf7 which has a number of cell line which themselves have children and so on you can do like this nice graph now otologous cell lines that derive from same individual so this happens when people are deriving more than one cell line from different tissue or from the same person at different time course a cancer cell line before I mean before getting chemotherapy and after chemotherapy and so on so you have cell lines like that which are from same patient or same animal and this is indicated and in the entry you have the list of cell lines which are derived from the same donor so that means that ftc133 is from same donor as ftc236 and 238 and you have about 10% of the entries which have one or more sister cell lines finally cross-reference and web links so I'm not finding this just after one thing cross-reference and web links cross-reference we have cross-reference to a lot of resources and that answers the question of do we cross-reference to sell bank which sells database yes of course and there is currently I mean today we reached 99 different resource which are cross-linked to cellosaurus and you have about 400,000 cross-reference inside the cellosaurus so about three in average per entry which is quite a lot and they can be cell line collection cell line database biological sample resource everything you want which has information on that cell line so of course cell collection has a major part but it can be a lot of different things and you have them classified by topics inside web view so to answer again the question where can I buy my cell line while you look at cell line collection and you see this cell line is sold by ATC by BCRC which is in Taiwan and by BCRG which is in Brazil and you have cross-reference to cell line database resource anatomy, color, type cell type resource, biological sample chemistry resource and so on gene expression polymorphism, sequence database and so on and web links are just links to page of site which go to resource either very few entries which have inconsistent URLs so it can be page on someone web page on the lab which is using a cell line which is describing that cell line so just web page now I go really to the last part of the contents reference which are quite of course important and reference as I said we try to include the reference that we have used to annotate the cell line and so is also one which describes establishment and characterization of cell line and other which have information which was useful to build the entry but it's not an attempt to capture all papers and make use of a cell line I mean if you did this Ila would have 500 or 1000 paper or 1 million paper would be completely crazy so it's papers which are interesting because they have inside information which contribute to the entry which you're looking at so about 24,000 23,500 different publication as I said before paper site have been created and all type of publications the great majorities are article, patents are in fact not I mean something to forget a lot of hybridoma and some other engineers cell line are funding patents and there is book chapters meeting abstract books etc and thesis which is not frequent but still I mean it's a good thing about thesis is people have space to describe what they've used and often you will have a paper having 3 lines on the cell line and the thesis from the PhD student has 10 pages describing the cell line much more in depth and fortunately more and more thesis are online and you can have 4 types of identity if you're on cell line you can have a permit ID or DOI, a patent number or a number we created ourselves for everything which doesn't have a permit ID or a DOI or a patent number not a lot but so you see most publication you find in permit 92% and 81% have both permit and DOI and you have some paper which are not in permit and have DOI so those are paper on fish and insect cell line because fish were not really captured literally on fish and insect were not really captured by permit which was a medical database so they didn't really care on fish and insect for a long time in plants but as we don't care about plant or cell we don't have this and only a few of them are not in any of those do not have a DOI or a permit ID because they're all paper or abstracts so number of citation per year you see there was a climb then there was a decrease and now again a climb it's not an artifact of the curation process it's a fact that there was a decrease because people were not publishing any more paper describing new cancer cell line or things like that and since there was this rise in paper describing IPSC so it's a new bump this fact that this is rugged it's probably an artifact of curation but the fact that it's growing again is because of IPSC okay how to make use of cell hazards you can go on the web I mean I'm not going to go now and do it and do a full except if we have time after but basically the most important thing on the page is the search part and I will explain a few things about the search and oh it's deficient now and what we're going to do to make it better thanks to the position we have from the Swiss and from Attic for the blog so it's on the cells you just Google and you find it it's as I said before it's since 2015 I said at the beginning of the talk and it's now used by 3,000 people per day 3,000 sessions so we reached 10 times million page view about a week ago or two weeks ago exactly so that's quite a lot it's increasing as you see you see the dip of Christmas quite well Christmas New Year people stop working you see a lower dip at spring break sociologically it's interesting you see this dip here I guess you see my cursor now everyone you see my cursor yes this dip is the beginning of the lockdown of COVID so suddenly people left the lab and were not yet organized to work at all so you had a big dip on a few days which started I mean confinement in most countries now here I got I'm going to skip this because I can go later to maybe show an entry with a different part but you can go yourself and look at an entry and I will concentrate first on text search because that's quite important so we have a full text search option meaning that you type in something and it's search for those to which are technically inclined it's really using apache lucy which is a lucy a text search engine and of course you can look for search for selign name everything which is inside the database but you can also of course look for catalogue number in the selign disease name species authors, title, format ID, sequence variation, transfected gene and so on so everything is indexed so you have the search bar and you just type in but of course text search are great but it's also a great way of not finding the right thing or finding too many things what I mean is that it's text so if you're searching for example for something you should use as much as possible a precise term and that precise term often can be an accession number so if you're looking for a species it's better to use the taxonomic identifier of that species rather than tapping the name of the species I give you an example if you type dog yes you will find some selines but if you type 9 615 you find all of the selines which are with the tax ID of dog now you could say but why can I type dog well you can maybe get an article which has a title which says study of cat and dog seline but which by chance is inside a cat because it's speaking of dog and cat it's in fact preferenced inside the cat seline entry and so you get also this cat seline inside your dog search same thing if you want SV40 transform seline you better search for the tax ID of SV40 rather than SV40 because you will pick up otherwise every title of article which has SV40 so if you want a knockout for CD4 don't look for CD4 use HNC1678 or if you're looking for an antibody against CD4 the protein look for PZ1 cysplatine cystic fibrosis here I search cystic fibrosis I get 102 hits yes I'm recording okay well so you only have here the seline which are from cystic fibrosis here you want to see those which are I mean with the deletions and you can type the HNC number of the deletion you could also type in the clean vow number but here's the deletion is generally quite precise enough to be able to get something which is really what you want so in a nutshell just to go back try to use the most precise terms that you can but it's not perfect and our plan is to do two things which I will describe briefly at some time but if I don't have time to go in more details I will describe some now is one to have still a few full text index but to be able like you can do in PubMed but people generally don't do is to tag what you want to search and to say in which field you want so if you said CD4 for example you could say I want to look at it only inside you know for example transfected so it will look only for the name CD4 inside the field transfected and not inside the title of an article or if you want to look for something else like cystic fibrosis search only inside the disease name for cystic fibrosis and not elsewhere and so on and the second thing sorry which is to build what we call a sparkling point I will describe this maybe a bit later by showing an example of what it means using the wiki data okay another tool which you can use which has nothing to do with what we saw is a similarity search tool but for STR profile this was developed by Thibault Robin who was a PhD student which developed a program to take the STR profile which you have in Celysaurus and search for similarity against the profiles that you provide it works on human, mouse and dog STR markers as a tree species which exist it accepts all the different STR markers which people have used over the years it uses a number of different algorithms I won't go into it if you don't know just don't change anything use the default back I would say if you're not a specialist of STR profile it's a bit like if you're not a specialist of BLAST don't touch the default parameters it's already preset for people which are not expert if you're an expert do change everything you want if you're not take what's default you can either do it on one batch mode using an input file with also CELINE you have for those which are more inclined to do it programmatically you can use this with a RESTful API and it's really fast so you basically enter either with a file or typing in your values, you search and it goes back with the list of CELINE and some percent match those which are in red are known CELINE which are contaminated so here you see of a lot of CELINE which have 100 or 99% because they're all either HT29 derived CELINE or CELINE which have been contaminated by HT29 and you have the list of the different hits between your profile and one you enter and you can click on the red and it tells you it's a problematic CELINE and so on now I will describe a project we did in exactly 4-5 minutes to explain also it will lead to what I was telling you about Sparkle endpoint and making very precise search most of you are probably heard of wikipedia maybe not all of you have called wikidata, wikidata is basically a type of streamline or I could say it's a pundant of wikipedia but not in text in the database mode basically instead of having a text saying whatever you want the lake of Lémon is the biggest lake in western Europe and so it doesn't matter you would have statements saying Lac Lémon is a lake a surface so basically it's a database of all kind of information basically of everything which you can think of which is based on what are we called triple triple being basically you have a subject a verb and an object I mean Lac Lémon is a lake so basically you define this lake surface is 350 km I don't know what it is probably wrong but anyway square 100 km and so in wikidata people are trying to enter all kind of information including biological data so we initiated 5 years ago a project to develop a boat it's a software tool which popularly did wikidata with saline concept each concept is equivalent to a cellosary century meaning a saline so there was a student Lélia which developed this and now there is a ban from addition from Brazil, Thiago Ljubiana which is updating this bot and what we do is try to enter into wikidata a little bit of information on salines and now inside wikidata you have as many saline concepts as there is saline and cellosaries now everything which in cellosaries cannot be modeled in wikidata for example we cannot put str profile we cannot put we can put disease but we cannot put mutation yet because not all the mutations are there and so on so we have a number of information items I'm not going to go through all of them but of course a name and synonym, accession species sex, category disease and so on and cross-reference to C.L.O. and H.P.S. reg and supermate cross-reference so now that it is in you can look at it inside wikidata not very useful because what you find is much less so if you go to this century XP100 TMA it tells you it's a saline yes great in French in German it tells you it is a saline it is a finite saline it is autologous so it has a sister saline it has two papers it's established from a medical condition which excelled in M.M.C. from Homo sapiens male it can be found in clove with this accession number with this accession it's in cellosaries and so on so you could say why goes there much less than in cellosaries yes but it's linked to all of the data in wikidata which mean you can do what we call federated queries across different sets of data and you can use the hierarchy of the different ontology which are used inside cellosaries like for example if I go back to seroderma pigmentism group C well yes you can look inside cellosaries for seroderma pigmentism group C but you cannot look in one search item for every disease which touches the skin like seroderma or like every disease which is a disease which is sensitive to light while in wikidata you can do this because the hierarchy is built into all of the objects you enter so you can do like mini programs those are called sparkle queries where you say I want cell lines which are objects which have a cellosaries ID which are linked with the disease and this disease have a property called q6013981 of course you don't you cannot know what it is but if you query it you will see that it means carbohydrate metabolism disorder and this is annotated with orphaned ID so what I'm doing here I'm searching for all cell lines which are associated with disease which have an orphaned identifier and those are the category of disease which are carbohydrate metabolism disorder currently I cannot do this from the cellosaries because if I type carbohydrate metabolism disorder you won't find anything because they are annotated to the disorder themselves but if you do this in wikidata you can have cell lines and you can see this one is pyrolytic carboxylase this one is galactosamia this one is ePremary, ePero, Accelerio, Taqua so it's a nice way currently to do those type of very complex search even so that we don't have this capability in cellosaries but the good news is in about a year we will be able to do this also from the cellosaries but currently you can do this through wikidata now downloading the cellosaries you can download it by ftp in 3 formats obo, structure text and xml so obo format I mean if you don't know what obo is don't even bother about it it's pro people which are really building ontology and say use this format also all format which can be converted from obo we don't distribute all but people can convert from obo to all and it doesn't contain all of the data because it cannot contain things like structure profile, age, flow reference and so on so basically the two main formats is either text or xml and I'm here again getting into more complex things that in fact that you I mean don't have the cross-reference inside you have cross-reference but so exact URL are not inside the file you have to build them if you want to instantiate that I mean basically what I'm trying to get there is if you're a programmer use the xml version if you want to look at an entry use the text version but xml is the one which allows you to pass everything and store everything into a database you can download it and well here I'm describing the xml version so additional file which are distributed with it which are more or less obvious by the name so release nodes to list of deleted accession number frequently asked question and so on so abstract for all those publication which either do not have a DOE or do not have both neither a DOE nor a commit so you cannot abstract except if you aren't in different publication or like we did we OCR some of those are abstract so they're available and you can basically download all of this of course it's free it's as I said at the beginning it's in it's structure it's you can also download each individual entry when you look at it in structure text when you're looking at an entry at top of the entry you can say it says saves the text version just at the top of the entry and it gets just that entry in text mode with entries themselves and the publication it's a full recall for the publications now also for those which are more programmatically inclined you can get all of those files on github to do if you I mean user of github you know what it is if not probably don't need it and it doesn't contain the cello 0.6 ml file which is too big for github because github is more to trace evolution of software tools not really to do a database so archiving the database is done somewhere else it's done a software tool called Yahetta which is institutional repository of Geneva University and you have all of the release of the cello 0.6 which have been archived there from release 2 to now but from release 2 to 32 you don't have cello 0.6 ml file and so you can get any release and download so it's all the release now last two things is what can be added a lot of things can be added what we're thinking about is to cover patient derived scenographed because it's currently only cell line not patient derived scenographed now one thing which will be useful but I'm not sure of course it's useful and it's even more useful now that people are I mean of course studying more and more I mean all kinds of various for the next pandemic or the next epidemic is to which cell line can be susceptible to infection by various because people want to use cell lines to study different type of virus and one for examples also work started on SARS-CoV-2 there was a mad rush to test hundreds of cell line to see what I mean cell line could be used and what people use was already information on cell line which was susceptible to SARS-CoV or other coronavirus thinking that maybe they could also be used which is quite logical so information on susceptibility of cell line to infection by various experimental information will be nice to capture and but this would need probably a specific grant and one person really working full time on it so this is not for moment something planned directly for 2022 maybe 2023 if we get a grant for it. Information on biosafety level will be useful for industry again to go back to the question which was asked to pose that's maybe something which can be added as a teaser for industry to pay for license of helping to develop the celery thing we add biosafety level because they really want to know which is about biosafety level of the cell lines. Now structures information as I said already all of the tissue organ cell type should be linked to ubera and cell ontology and the age information should be linked to developmental ontology so that you can search all cell lines from a teenager or below all cell lines for different type of developmental stage restructures information on cell line availability because there is no cross-reference to cell line collection discontinuation some information on lots of cell line and maybe also to go back to the question where it was saying oh but if it's not in the collection or kind of where it is we could capture maybe if someone tell us we distributing the cell line I mean you can tell people to mail us we could add the email address of people which are willing to redistribute their cell line because some people do and for moments we cannot indicate it tools a lot of things will be done because we now have one sector developer which can work on cellosaurus and improve web recommendations we want to have forms for user to interpriminary new cell line entries so to submit I told you about this sparkle endpoint to advance search and federicated query but also advance text search and for developer this turtle triple rdf terse rdf language which allow people to use cellosaurus in context of what is called semantic web an API to query or download the database that you can do I mean very specific query in a programmatic way and retrieve exactly what you want an alerting system for newly discovered contamination misanification this again is something that industry needs to have so all of these are developments which can be done I mean now that we have someone on board to do this which leads me to the people who work on cellosaurus it is a guest tiger which is one of software developer for Swissfront for the Unifront group in general and was one of the first person to develop Xpazee when in 1993 when Xpazee started Xpazee so at every release of the cellosaurus she writes or change the script to make it possible for you to access it on Xpazee Thibault developed a class tool during his PhD studies and I got to which just retired a few months ago was a software developer in a group, an expert California group and he developed the tools that translates the cellosaurus from text format to Oboe and XML and he started creating syntax is okay I mentioned already Lillia and Thiago for the wiki that I bought and Pierre-André Michel who started not in the group because he's been in a group for already a long time but we started working on the cellosaurus a couple months ago and he's starting the long road towards development of all the tools I was telling you the software tools I mentioned already by saying she was the former secretary general of ICLAC, she just stopped being a secretive general this year because it was too much work but she is still being a detective, a saline detective trying to hunt all problems case for cello and all of the people which are answering my questions or which are submitting information on cello you can always ask questions of course now but after using the email address cellosaurus. You can contact on cellosaurus page on twitter we put a lot of information on things which are new on the tidbits of information on cellosaurus so you can follow also the twitter handle it's not spam basically all of the things which are in that feed are useful information for people using cellosaurus and it's not a lot of tweets so it's not you're not going to be spam