 Hello students, as we are you know reaching close to finish this course, you have already got a glimpse of variety of high throughput technologies available. For example, we talked about microarrays, we talked about label free biosensors, some exposure of next generation sequencing and mass spectrometer. All of this are contributing towards big data generation and many times you might be thinking that until unless you have access to a lab which has all these kind of you know high-end gadgets, you cannot do your projects, you cannot do your research in this area without availability of these equipment. Partially it is true if you want to generate some new data on a kind of sample which nobody has tried yet, then you have to run your sample and analyze data. But what is very important here in this case, nowadays there are many public repositories that are available and those public repositories are able to provide all the raw data file to the entire scientific community. And that is one of the big you know shift and throughout in the world that all the journals are now making it mandatory to upload the raw data files and eventually after publication the data becomes in public repositories. So you can access lot of NGS data, you can access lot of mass spec data, micro data, all of this in the raw format. So you always need not to have perform your own experiment and generate data with your own equipments. You know you can always start with the public repositories, try to obtain the data file, process them in very uniform manner, look at what is the hypothesis which you want to test out and try to build that particular hypothesis and see whether these data set are able to answer that question or not. So this kind of you know I must say that availability of various public repositories and the OMEX data set availability has really shifted the gear of the entire OMEX community where now the set of people who are still doing wet lab experiments and generating data, but there are set of scientists who are starting to do lot of core bioinformatics and proteogenomic analysis to start putting the things together and start utilizing the data sets looking at in a very very uniform manner. So the question arises what are these resources from where you can obtain the data set right and information for them is very very important because once you know that you know from which places you can go and you can obtain the data set then you can start doing many of these interesting analysis and processing data and visualization yourself. So in today's lecture and hands-on session the through Biswas the research scholar in my proteomics lab at IT Bombay will take you through different portals from where these data can be extracted for further analysis. So let us have today's lecture and demo session. So first we all know that even before 10 years even before 20 years the amount of data generated was not that much huge that is getting generated today. So in 1953 Watson and Creek was the first one who proposed the double helix model for DNA based on X-ray and the data and the generation of the data on the basis of that can be taken as a first milestone of data generation. After that in 1955 the sequence of the first protein was to be analyzed which was bovine incivillin followed by 1970 the first algorithm that is needleman Wunsch algorithm was came into play. In 1975 a major breakthrough happened when Microsoft cooperation was founded by Bill Gates and Allen. Further in 1988 the national center of biotechnology information that is the NCBI which we all are familiar with was established. In 1997 the genome of E. coli was published after that we all know the biggest breakthrough that happened with the publication of human genome project around 2004. After the publication of the human genome project we never moved back and the amount of data generated was huge and still now we are generating huge amount of data. So now let us talk about different databases that are available. So if I give a very simple example that databases is a collection of data any data that can be protein proteomics data genomics data that can be metabolomics data and if we are talking about parts if you are talking about other backgrounds then there are many databases that are available like in terms of astronomy in terms of ecology in terms of cosmic sciences the databases are available. So what is the main role of these databases? So the main important role of the databases are availability of biological data, systemization of the data and analysis of the computed biological data. So as we know with the advancement of the technology and experimental strategy omics experiments and omics platform has really developed in such a way that the amount of data we are generating is huge. On the basis of that from a long time database has been segregated into different forms and different levels on the basis of the data types, data sources, different data design and databases. On the basis of data type there are a number of databases that are present like genomic databases, microarray databases where you will get the microarray process and pre-processed files, pathway databases like example keg reactor, disease databases maybe OMIM and this kind of databases which are already have a level on the basis of data type. So now what are the principal requirement of a public databases? So first data quality the data should be curated should be of high quality data. Next is supporting data the database users will need to examine the primary experimental data either in the database itself or by following cross reference back to the network accessible laboratory database. Third is the deep annotation supporting an ancillary information should be attached to each basic data object in the database. So next is the time time line list that means the data which you are putting into a database should be available on an internet accessible server within or after few days of publication or submission and finally integration. Each data object in the database should be cross referred to the representation of the same or related biological entities in other databases. The amount of data we are generating now is huge and from where we are using a very frequent term about the data is like big data. So from where the word big data comes from. So the term has been used since 1990s which was given by John Mache. So big data is usually includes dataset with size beyond the ability of a commonly used software tools to capture, curate, manage and process the data with a tolerable elapsed type. So the big data comes with 4 v that means volume, variety, velocity and velocity. So now let us come back to the databases that are available for proteomics and genomics. So let us take an example a very popular database that is available is pride that is proteomics identification database which is a public user populated with a public database and user populated proteomic data repository. The repository contains data generated by mass spectrometry proteomics experiment which includes raw spectral data peptides protein identification and associated statistics even different parameter files that are used for generating the data or processing the data. Pride suppose the submission of data generated from many platforms in specific data format known as pride XML file formats. So let's come to another database that is peptide atlas. The long term goal of the peptide atlas project is a full annotation of eukaryotic genome through a thorough validation of express protein. Different related databases that are available in the website like SRA Matlas which contains all the database all the datasets of different targeted approaches of mass spectrometry. It is also same which contain different datasets of targeted proteomics phospho pep. The name itself suggests that this part of the database contain the data the phospho proteomics data Unipep and MS Pecline. So this databases contain different levels of proteomic data information which can be accessed and downloaded. Let me show you a glimpse about proteome exchange which is a repository that is a collection of different databases like peptide atlas massive pride j post and panorama. This proteome exchange will help you to download process and pre-processed data and let me give you a small hands-on on proteome exchange. So let's try to explore proteome exchange and I will show you how you can use proteome exchange for downloading different proteomics experiment data set and how you can use those proteomics experiment data set in your further experiments. So first let's first search proteome exchange in Google and click the first click the first one and as you can as you can see that the proteome exchange consortium webpage is available and you can read about the proteome exchange consortium over here and you can understand how proteome exchange is a hub where all the different datasets and repositories which are available worldwide has been interconnected to give you a better access for downloading data. So in proteome exchange there are three tabs are there. First one is for the public data, second one is for the data submission and third is the subscribe. So data submission is important and you will require this tab when you want to submit your own data of your own experimental proteomic experimental generated data. But for now we will choose the public data for accessing the data. So we will click the access data tab and we will found that a search query search page will open where you need to search your keywords for searching datasets. So here we can use the advance option and we can search on the basis of title dataset even on the base of instrument if you want to download data only experimental datasets for thermo fusion or you want data from Q-TOP. So you can mention the respective term over here and you can do a advance search. So let me show you taking just an example of how to download a dataset. So if you come into the dataset result dataset you will find that there is a dataset identifier. So the dataset identifier contains an unique pride ID. So this unique this pride ID is unique for each experiment and if you click this one it will redirect you to the page for downloading the data. Apart from this in the title option you will find a small title regarding the dataset the repository name whether it is a pride repository, J-post repository or a massive repository. So the next is the species. So even you can use species filtering also if you want your data only from Homo sapiens. Then the instrument publication and whether the data is from which lab. So the lab head name will be here. Let's choose the third one and I will and let's try how to download this dataset. So after clicking that you will be redirected to a interface of the pride and you will find that the Phd014971 is the pride unique pride ID for these datasets. There are couple of informations are available about the dataset like announcement date, what is the description of the experiment, spectral list, modification list like this. But if you come to a publication list and if you click here the page will redirect you to the publication page of the NCBI and you can check directly which paper it is and what is the material methods or the results or the experimental evidences of this paper. And this you can relate with the data set available over here and it will help you to download the datasets. So as we can see that this is a panorama public datasets that means this dataset is available this dataset is available in the panorama. So let's try to get some dataset for from pride. So we can use this one. So this is another dataset whose unique pride pride ID is Phd013955. So here also you can see the all the details are available and the publication list is available and there are two links are available. One is dataset, FTP location and the pride project URL. So you can use any of this link and that and it will redirect you to the directory from where you can download the dataset. So for downloading this dataset first you need to know that whether what is the terminology or what are the files that is available. Most of the time you will get this information from the publication supplementary files. So you need to download the supplementary files and check the sample table or the sample information present over there with these datasets and on the basis of that you need to take the call which datasets you need to download. So for downloading a dataset if you just click any of the if you just click any of the files it will start getting download as simple as this. But when the file size are huge and you need to download the complete file size clicking each files and downloading each file will be little tough. So for that reason you can use Linux operating system. So as you can see that downloading a complete experimental dataset for an publication is so easy and even you can go for multiple datasets and you can download those you can integrate the dataset and you can do further analysis. Let us go to another database that is array express. So array express is a kind of functional genomic data stores where the high throughput functional genomics experiments provides this data and you can use this you can download this data and you can use it for your analysis. So let us try to check how you can download your data from array express. First write array express in the google and click the first search option of the array express. So this is the web page of array express where you can see that it is a functional genomic data but here you will find data from both proteomics and also genomics. Most of the data are available here from RNA sequence RNA sec microarray and even from mass spectrometry. So let us try one example how to download a data from array express. So I am typing Glyma and I am searching Glyma in the array express. So after the search gets over we can see there are multiple datasets that are available and total 1168 experiments are total available with the search in the search result of Glyma and this 1168 is a huge number and we cannot download all the files or that means we need to put some filter. So for this we need to select the filter search results over here and as you can see there are multiple filters that we can put. Let us put homo sapiens, let us put human. Then for the experimental type let us choose protein assays. Further let us choose micro mass spectrometry assay and filter the data and after filtering the complete data with the following given filters we can see that there is only one file that is present. Only one data said that is present which is having the proteomics data. As this database as this repository is completely based on is mainly based on genomics data. So let us try something with genomics. So maybe we can use this one as RNA assay and we can choose this as all technology and let us filter the data and after the filtering gets over you can see there are 671 experiments we found and there are multiple data sets and experiments are available. So now on the basis of if we click the assays it will help in sorting out on the basis of the maximum number of assays. You can select which experiments you want and which data set you want and you can download the data set by just clicking the accession number. So after clicking the accession the it will redirect you to the to another page where it will give you the details of the experiments. So like the name of the the title of the experiment followed by contacts description of the experiments and different samples and what are the files available. So the main important part is what are the files available over here. So the first is the investigation description. Second is sample and data relationship. This is the most important and crucial file that we need to download from ARA express because it will give you it's a kind of a metadata file which will give you a complete total information of sample and data relationship. Next is the raw data. If you want to download the raw data the as you can see there is 152 you need to click the raw data and it will take you to a page where you can download the complete raw data over here. So you can see like how ARA express can also be used for downloading a data set. Let's move towards a database publicly available database which is very much informative and in proteomics we should know about this database to a large extent that is human proteomatolous HPA. It is the Swedish based it's a Swedish based program which was started in 2003 with the aim to map all the human protein in cell tissue and organ using integration of various omics technologies including antibody based imaging mass spectrometry based proteomics transcriptomics and system biology. All the data in the knowledge resources is open access to all the scientists both in academia and industry to freely access the data for exploration of the human proteome. So the human proteome database so the human proteome atlas has been broadly classified into three major atlases. The first is a tissue atlas which contains the information regarding the expression profile of human genes both on mRNA and protein level. The protein expression data from 44 normal human tissues type is derived from antibody based protein profiling using immunohistochemistry. Next the cell atlas the cell atlas provide high resolution insights into spatio temporal distribution of proteins within human cells. The protein localization data is derived from antibody based profiling by immunofluorescence confocal microscopy using a panel of 64 cell lines to represent various cell population in different organs and tissue of the human body. So let us try to explore the human proteome atlas. So as we have already understand human proteome atlas is having cell atlas, tissue atlas and pathology atlas now we let us try how one protein can be searched in all this part in all this atlas and we can get huge amount of information about a protein. So search human atlas human proteome atlas in google and click the first one and you will find a search dialogue box in the first page even you can read each of these tabs tissue atlas cell atlas and pathology to get more information. So let us try with one example insulin. So as you can see when I am searching insulin all the proteins which are all the genes which are available for insulins are present here. So now you can see that there are multiple tabs which are present and which are giving you multiple columns which are present which are giving you lots of information like gene description protein classes whether what is the information available in tissue atlas cell atlas and pathology atlas. Now let us try to incorporate some more information if you are clicking the advanced option you will find that there are multiple tabs available over here. So if I want to check whether the proteins in the evidence level whether it is having the evidence level whether it is having any HP evidence whether it is present in uniprot that means it is having any uniprot evidence or whether the protein is having any MS evidence. So by clicking these four things so by clicking these four options we can select apply and you you can see that all this information has been incorporated in the columns. So now if I am choosing the first one INSR which is insulin receptor the insulin receptor is having evidence level both in case of HPA uniprot and also it is having a MS evidence and if you just put your cursor in the green color box you will find that what is the evidence level it is present. Like if I am putting MS evidence it is showing that it is present in evidence at protein level. So now let us click INSR insulin receptor and let us try to explore more what are the different information available for this particular protein. So the first thing here we can see that there is a general information of this protein and it has given gene name protein class and even localization. So next thing is given for the human protein atlas information and RNA tissue category protein evidence and protein expression. So protein evidence is saying that it is evidence is present in a protein level. Another thing I want to tell that the final the final annotate protein expression which is given here is estimated on the basis of antibody data, RNA-seq data and available protein and gene characterization data. So if we move forward we found that there is a reliability score present here and it is telling uncertain. So what does this mean? Actually on the basis of the reliability score that proteins has been divided into four category. First is the enhanced second is supported third is approved and fourth is uncertain. So enhanced means if one or several antibodies with non overlapping epitopes targeting the same gene have obtained based on orthogonal or independent antibody validation method then they give this score as enhanced. Second is supported that means the consistency with RNA sequence data or protein characterization data in combination with similar staining pattern if independent antibodies are available then they will give this score as supported. Approved says consistency with RNA sequence data in combination with inconsistency or lack of protein gene characterization data or alternatively the same thing there is a inconsistency with RNA sequencing data then this data is tells as approved. Finally uncertain that inconsistency with or lack of RNA sequencing or protein characterization data in combination with dissimilar staining pattern then they give this reliability score as uncertain. So finally if we move down we will find that there is a RNA expression RNA expression profile and protein expression score. So as we can see what this profile has given so much information that what is the particular level of expression in RNA and protein for a particular protein. Same thing if we go down we will find that the RNA expression profile and the protein expression profile is given for the particular protein in different tissues over here and to get more information we can click each of these tabs over here and we will get more information. Next the protein expression overview and the RNA expression overview this has given a complete so this has given the complete view of how what is the level of protein what is the expression of protein in 44 different tissues in protein in terms of protein and in terms of RNA. And finally if we come down we will get more information regarding the gene so now what is protein browser. So the protein browser displays the antigen location on target proteins and the features of the target protein. The tabs at the top of the protein view section can be used to switch between the different splice variant to which an antigen has been mapped. So as you can see there are lots of information available even in only a single tab that is tissue atlas of HPA. So now let's move to the next one that is the cell atlas. Cell atlas will give you more information regarding the localization of the of the protein. So as you can see in the first tab it is given that the predicted localization of this protein is intracellular comma membrane and apart from that it has also given that what are the main location of the protein and it is given that the location of the protein is in the vesicles which has already been approved with the help of the indirect immunofluorescent microscopy image. So the same thing is also present in case of the mouse cell and here the protein is found to be located in the golgi apparatus. So like this you will if you explore you will understand and you will find more information regarding the protein in the cell atlas. So now the third one is the pathology atlas. The pathology atlas itself defines that it is based on the different diseases and as you can see what is the status of this protein in different diseases in terms of RNA expression has been given over here. So now if we click renal cancer we will find that there's lots of information available about the disease. Here we can see that what is the number of patients that are available what is the number of patients that are alive and what is the number of patients that is that is dead. What is the sex ratio of the patient and everything over here is given even the survival rate curve is available and the information of the patients in terms of their age in terms of their survival rate and what is the status of the patient is also available. One of the important features of a human proteome atlas is most of the information will be our self-explanatory if you just prove your put your cursor over the eye present in the dataset. So as you can see that if you want to know about this part just put your cursor over here and it will give you a explanation about that part. So as you can see that human proteome atlas contains so much information of only one protein. So now as we talk a lot about different data and databases and there are many databases that we can explore and that contains different information. So by now you know that there is huge amount of data which is available in the public repositories and databases which could be extracted and used for the analysis. There are many big research groups and large funded programs like human protein atlas the cancer genome atlas or TCGA and there are various labs from the broad institutes of MIT and Harvard they are sharing their entire raw data into different databases. Also as I mentioned all the journals are making it mandatory now to provide the raw data files. So you have access to large number of you know very good quality datasets available. All you have to do is to extract the data and perform different type of analysis. So if you are familiar that you know for which type of dataset what kind of analysis to be done then you don't have to rely on donating your own data all the time. You can always start even while sitting in a small college somewhere in the remote part of India or anywhere in the world you can start your own experiments on your own computer and then you can start coming up with a very very fascinating hypothesis which will be really you know transformational in nature because now you can see that you are working on the raw data and you are looking at some hypothesis which nobody has tested. Think about you know the cancer genometallus what they do they provide almost thousands of cancer patient dataset they also provide lot of clinical dataset they also provide lot of follow-ups of these patients which kind of you know drugs were given to these patients what was their response which patient you know showed the recurrence of the tumor. So now there are many questions which one could start looking at that what was the effect of a given drug and now you are analyzing the data based on just effect of a given drug on one subtype of the cancer population or you are looking at which percentage of the patient you know they got they showed the recurrence of tumor tumor you know recurred again and then which subtype they belong to what kind of you know the genes were expressed in that can you associate anything linked to that with their patient survival or recurrence of tumor. So many interesting thing can be done which is actual real research project just on your own platform on your own laptops on your own system which does not require generation of dataset. So I hope this kind of you know the information regarding the availability of databases and resources as well as different tools available for doing analysis is going to be very practical and very useful for your own research. Another important point for obtaining these datasets are you can always do metadata analysis you can start comparing the datasets obtained from Indian population versus Caucasian population within Indian population you know maybe something from the eastern region versus a western region northern region you can start looking at the demographic based analysis as well which is otherwise not possible you know many investigators always look for even certain samples of their region but you can go transatlantic you can go much you know panatlantic and you can start looking at data in a very comprehensive way right and you can also start integrating dataset you need not to only look at proteomics data from different regions but why not you know start integrating proteomics, metabolomics and transcriptomic data along with genomic information to really you know get the systems information which is otherwise not possible from the individual investigators. So I hope a lot exciting thing can be done by you know in this day and age of you know computational driven field lot of you know things can be done without having access to the instrumentation and technology. So I hope your understanding obtained from this course is not going to limit you just to rely on the you know instruments and generate your own data but rather you start looking at the data in a bigger context you start looking at metadata analysis and make biological sense of interesting question which you always wanted to address. I hope you will be able to use some of these repository the databases in your own research. Thank you.