 Thank you very much. So first of all, we would like to thank you for attending this talk because we know that there are All the sessions very interesting at the same time. So it's difficult to choose. So my name is Sara Nieto My colleague Pilar de Teodoro both working for the European Space Agency in the facility in Madrid So today we are going to talk about how do we Work with the Postgres database in the context of the astronomy, which is a very awesome Project so in the context of the Euclid mission So this is the overview of the presentation. So first of all, we will go into the Context of the European Space Agency and then we will present the application that runs on top of Postgres which is the Euclid archived and We will present the different testing that we have done in the context in the context of the database and Finally, we will present the final conclusions that we got from this work So first of all the introduction of the European Space Agency The European Space Agency is an intergovernmental organization So the members have 22 up to now and Canada is an associated member So these are the different establishment facilities of ESA across Europe So currently we're working on the Spain facility in Madrid, which is ESA which is devoted to the science operations Center and is devoted to host the archives for the different missions at ESA So just to put some figures Last year ESA managed more than 5,000 million euros of the budget so across the different project is distributed into Earth Earth observation missions launch and many other projects that you probably already know like Rosita and many other missions that are very very exciting projects But today we are going to focus on the work in the Postgres database work that we have been Performed in the context of the Euclid mission So Euclid is a spacecraft that will be launched at the end of the 2020 and its main goal is to answer and to Understand the nature of that energy and that Matter which is 95% of the universe where only sorry 5% is considered ordinary matter So it's a very exciting mission. So it contains two different instruments We have one optical instrument, which is basically a camera that is taking Images from the space and we have another near infrared instrument that will take images also and a spectra So we have different kinds of data coming from the space from the Euclid mission So we have just to put some details We have a 1.2 meters telescope and the spacecraft will be put into a stable orbit. So to avoid any kind of perturbation so is a is of course Responsible for the mission, but it's a joint collaboration between is a and what is called the Euclid Consortium and US is participating into this consortium. It has some more than 2,000 members and more than one more than 130 Institutes participating and it's a huge collaboration. So it requires massive coordination effort across different data centers So let's go into the Production of the data Euclid will produce an amazing amount of volume of data So we are talking about a petabytes of information it will produce up to 26 petabytes per year of information and It will produce also a catala, which is basically a collection of objects seen in the space and That will be the first master version will be up to 10 billion rows Collecting different metadata from the space like the brightness of the stars like the position of the stars and many other Metadata that can be interesting to make science from that So as I said at ESA we host the the archives of the different missions of ESA so the archives are basically the applications in terms of storing but also Publishing the data to the users community to the public community and that allows The interfaces at proper interfaces to make science with this data. So In the context of Euclid, this is basically the architecture. We have two different subsystems one is the the purple one that is devoted to The processing all the products that are produced Inside the different data centers are registered against this archive for the processing and then according to the different public releases They are transferred from this purple system to the blue one, which is the science archive system Which is the one hosted at ESA which is the one that we are currently working on the implementation so The purple one is the the data processing system is developed by one of the science data center Inside the consortium of Euclid which is located in Holland And as I said the the scientific archive system is a responsible is a responsible for it and it's developed at ESA in Madrid So going back to the data releases as I said the first date will Euclid will have three data releases and the first one will be After 26 months after the launch So we will have at least 10 terabytes of data that will be in the catalogue and up to Five petabytes of images that we have to store and to serve to the public So we have a very high level challenge to address inside the Euclid archive So now Pilar will explain the details about the postgres testing that we have performed in the context of this In the context of the archive Hello, so the idea is that in our system the Euclid mission is part so as Sara said we have the processing system and the archive system Even if we are working the same mission The development teams are different so the people who has developed the processing system is in Holland and We at ESAC are in Madrid and the team is different But also we are you we can use different databases to connect So the idea and the topic of this presentation is how to transfer the data between Oracle and Postgres because then later we'll see and maybe for another PeeGeeCon if finally we transform We convert this Oracle database into another database provider probably postgres, but the idea of this presentation is to Speak about the metadata transfer system. So how are we going to pass the data from Oracle to Postgres? Why you're a call first in the source as a source database well in Holland they they were using a Astronomy tool called astro wise that was based in Oracle So it was easy to make the prototype in Oracle on the other hand at ESAC in the archives team We are used to to work with Postgres most of our databases are Postgres And also because we have the knowledge in the team Because this is a database that we rely on for the future because as you can imagine So in every mission there and every project will be different phases So we have development operations and then this part of operations that will show the data as Sara explained With the releases to publish the data first for the consortium then to the public and then a legacy archive The budget at the beginning is very high continues Growing but then goes down very fast. So for the legacy system Our budget will be low and postgres will make us a good work to continue with it So different tests has been done. So we have used the technology of Oracle Golden Gate the foreign data wrappers using also Materialized views at the creative last select and finally files that will store in the postgres Repository in as Sara will explain later, but let's have a look at our data Because it's not very simple So this is a reduced part of the data model that is running for the Euclid common data model only here Is we are showing what is calling by the optical what is seen by the optical telescope and the Visual the visual the images so you can see in Oracle We are using it as an object oriented database Well, I forgot to tell you that even this is developing whole and everything will run in Spain So just for for next year, we will be masters of everything now We are a replica of them, but then we are using Oracle data bar for this but then We will be the masters in Spain So you can see here the complexity And we have to simplify that to come to the archives. This is our model for the archive system that Responds also to the necessity that we have and also is similar to other archives. So we have the experience of this So how to transfer this data from one? data is provided to another uh Golden gate, how many here have used golden? Very good. So pretty Yeah, and what is your experience you like it? Good, okay, so um, just one slide to explain What is the hardware that was used to make the test? So our budget now is limited as development and for this test we just use one node with 128 gigabits of RAM But we have here more databases But we have configured one with noracle and 11 g with 20 gigabytes for sga And then the postgres the latest version with 20 gigabytes for the share buffers So we use ssd disks and 16 cpus But coming to oracle golden gate So for those who don't know what is golden gate is a Is a tool a very powerful tool that oracle booked in 2009 and the idea was to replace the oracle streams I don't know if someone here worked with oracle streams, but they know there is a pain and so So the idea is that a golden gate is much easier to use And it's really good. It can be used with different databases Not only oracle I can use unidirectional data replication. So it's a it's for replication Or by directional But we for our purposes the idea is to move from oracle to postgres of course and only one in one sense So the idea then for oracle the golden gate is That we have a source database Okay, our postgres then we will have a extract Process defined that is listening to the changes the dml and the ddl changes done in the oracle database This will be written in trail files sent to the target destination And then another processes listening in another port will That is called a replica then will apply them to the target database So this is pretty simple and it's straightforward once your configuration is done. The problem is this configuration At the beginning once this is done is post straight forward has these car files. You can reprocess, etc. It's very powerful So we have Some slides like can serve as tutorial for the oracle golden gate Apart from going out to see some numbers, but as we have only 50 minutes We will put them in the final slides that we will will be a upload I think next week So on a case that you are interested we can show you later But the idea then with oracle golden gate is that to get some numbers So if we ingest into a table that here for example, it was a table of 700 Megabytes with 6.7 million rows that is show is keeping the information of the camera of the Euclid camera pixel level That is not very well, but to show you what we have um So to ingest this data into using a Database link from another database in oracle. So this will take like five minutes, but when we pass it to to To postgres using golden gate. It only takes 25 seconds. So for transfer and apply. So this is pretty fast Can seem so this is 28 megabytes per second So it's good, but what's the problem of oracle golden gate? So Pricing I don't know. I I don't know because this is a joke in spanish But oracle in the other way red is el caro. It means the expensive so It's clear what happens with oracle okay, so The idea then is that it has a lot of pros Because it's almost real type replication and also the support from oracle. We know it. It's good So if you say without which database you want to use, okay, oracle, but maybe you cannot afford it Because you have to look into the long term So the cons will be the pricing and another important point It will be that it doesn't replicate views the golden gate Because if you have to make any update in the view then it will do it through the table Behind okay, so it can replicate materialized views Um, but then it will mean that in our dps side We have to ask the developers from the other system to create us these materialized views and And update them and maintain them Correspondingly, so there are many workarounds, but none is straightforward, right? So foreign data wrappers I'm sure most people here will have worked with foreign data wrappers. Am I right? How many people have? With foreign data wrappers. Yeah more than golden gate So it's it's clear. So foreign data wrappers so foreign data wrappers are used To connect different database providers So it can be postgres postgres post oracle to postgres mysql to postgres Mongo to postgres csv So at the archive steam we have the experience to have done it with mysql with oracle and postgres to postgres it's very powerful and And The good thing also that we don't have to take all the information from the source database we can What the foreign data wrappers does is to connect to the server and pull the data from it Okay, and so and also the good thing is that if we don't put any Function or wherever the execution plan is used in the the oracle one So this is very good for us because the query is resolved in oracle And so just to acknowledge this is a written by Loan Salve And the the version we were using for this This test is the one dot five Dot one and for all in postgres as I said it was the 962 and oracle is the 11g so Finally Another very important point for us from the development point of view is that we have to make a mapping between the tables in one system to the other As I saw we saw is that our system in the dps is very is very complicated And so we asked them finally to create some Views to check how this will work And that will match our system In the in the SAS in the scientific archive system So we have foreign tables reading From the or from the origin and that will match our tables But also can be split it in several information. Okay, so this is very powerful for us And one thing that we like very much from 95 is the import schema Because before you have to explicit Which table you like you have to make the mapping with the data types And everything and with the import schema and the limit to for 1000 tables is very easy To make the limit to some of them only and then to ingest them in the schema that we need So then after If you use foreign data wrappers Only if your table is very small you will use it directly every time to the source But so the normal thing is that you create a Foreign table and then you will materialize it to be able to index it and to do And add more columns add functions to columns Or to create table as select this is it was before the materializing views was existing There was this lazy update cigarette dates. I'm sure that you have used it as well And so also the good thing of creating a table instead of using the materialized views is that we can use triggering with the tables so One important point also of this approach will be the refreshes So what's the the problem with the refreshes? So if we refresh a materialized view we will have a lock in the table So we will not be able to accept it while it's being refreshed. That's why from nine five the refresh concurrently Was done This is very good because this will make an access to the table and the way that the refresh is done is using a Is doing uh, it's transferring all this data to a temporary table and comparing the results with the begin with the first one What is the problem? Well, first of all this will need a unique index to make these comparisons, but also The problem will be that this takes longer And of course if the vacuum is not done after the refresh concurrently as we have created another temporary table Ours size of the database will grow. So instead of having a 70 gigabytes database You can be with 700 gigabytes and you don't know why so this is something that Has to be taken in mind And on the other hand we have the the Create table as select where you can just throw the changes apply the changes So to have some logical behind that manually, but it works. So for some numbers So this is the data types mapping that done by the oracle by the foreign data wrapper So for us the we have we are lucky because we don't have much problems with the data types that we use But this can be also important. So let's get let's see some of the tests we have done So first of all, as I said, it's very important to know what we will have in the in the source database If it's a table or it's a view that will be resolved there So if we have a table, for example, we create this table In postgres, then we create the foreign table using the connection the server connection That is using a one schema and the table name Then we can create the table in postgres and make an insert because then this can be in different steps, right? So to make this insert So the same as we saw with call call golden gate here is taking longer is 50 seconds Well, this test was done several times, but this is just a number Okay So and also the other way will be just to create from the beginning the table So every time if we need we just recreate the table and also can add as I said before another type of Of columns like for example a sequence for an an idea or whatever and more or less takes the same The good thing is that if we insert more data like another 6.5 millions again 6.5 billions that time remains constant So this is important also to know that they will not grow in time because the our table is bigger so But if we do it with a materialized view is a little bit less time But the thing is that here you cannot see very well the execution plan But this is done in the source. You will see it in the slides if you download them Then for the refresh materialized view Then we create the index and make and compare the refresh with the concurrent Refresh and we see really a difference here And if we continue adding more rows the difference is more evident Okay, so this is for large tables. It's not very well recommended although it has a very powerful use So if instead we use a view in the origin so Our views are quite complicated We use cte's the common table expressions to connect different tables as we have so many the cte's Is very powerful to connect so many tables that inherits one from the others And so then if we use views We just see that the time is really much longer than if we use the table in the origin And we can see here in the execution plan how this is resolved this way in the in in oracle Also, it's very important to acknowledge that the That the time for making a select count in one in oracle in the origin database For example for this table a select count will be only 10 seconds But if we use the foreign data wrapper directly to measure the latency between the transfer Is Three point times slower So this is no synchronous, of course, but well depending on your requirements. This can be very well done so Then um Then we can consider these lazy updates like saying okay every time in oracle that we want to make a release So we put a flag let's say release number 1.2 And then we pass it to oracle through a controls table just saying where where we are right So this will be very powerful as well, but also can take also the idea of the file ingestion So just conclusions for this test So if you can afford a synchronous transfer between oracle and posgres The easiest way will be to use materialized views because it's really easy to implement All the golden gate because once the configuration is done is really good fastest golden gate Then the insert a select more than the materialized views really No need to of ad hoc scripts to refresh So nothing to touch after that will be the golden gate the network latencies to take in mind And in the relational model always even if the the other team we have to press To make materialized views and maintain them So we need tables or materialized views in the source to pass the data not to read from posgres to oracle to to have a Explained plan that we is really doing many things because the joints are done in the origin database So in the meanwhile, we will continue using oracle foreign data wrapper because there is a A chance that we change Oracle to another database provider and we pushing hard that this will be posgres But we have to see how are we are going to handle these 1,000 rows with this object oriented to be converted. Okay, so Okay, sorry, let sara continue with the ingestion system Thank you. So as Bilal has explained deriving from the fact that the implementation on the original site is quite complex because we have quite large data model in the in the origin and we have a Many many tables involved in the implementation On the oracle site This is we also explore the option to feed the the scientific archive system directly from files So this is the work that we have done So we have a mechanism that allow us to ingest the XML metadata coming from the from the different data centers So we ingested into the posgres SQL so we have been done different tests Run on posgres sql single instance and posgres excel given the amount of data that we are going to manage So this is the high level description of this ingestion mechanism Where we have the xml metadata files coming from the data centers So we have the first step is the binding of these xml files And then we have this mapping and then we have the ingestion on the database So this is basically the process. So as Bilal has explained, this is the data model in the origin That's as I said, this is only the part that corresponds to the visual instrument of the spacecraft So there is another part corresponding to the other instrument and so on and contains a lot of information about the processing That's why it's more complex that is that the than the scientific archive system Okay, so this is our model and we have produced the mapping that as Bilal has already explained in the case of the Mapping between oracle and posgres. It's done in a different way And in the case of the file ingestion mechanism We have to feed our system with the specific configuration file that Produce this mapping in this mapping with what kind of information that we are going to have Is basically how to map the entities how to map the fields and how to map the the data types basically So we have produced some figures for the ingestion up to We have ingested into the posgres single instance 9.6 up to 50,000 files and we have Retrieved This is very fast in fact And what we have realized is that the most of the time the most demanding process of the ingestion is that the parsing The reading and the parsing of the file given that Their file is quite large and we have to parse and then to filter and then to ingest it into the posgres database So for a single instance we retrieve that for a single sequential scan of a table of two 100,000 rows 56 milliseconds So as I said the most Demanding and consuming time of the process is the un-martialing which is basically the parsing and reading of the schema and the insertion it's Basically zero compared to these to the other steps So we have done as I said the same test with the posgres excel configuration of the 9.5 version and this is the first Cluster that we have installed we have three nodes one Global transaction manager one coordinator and three nodes and the same capacity for the three nodes One thing that we figure out with the with the schema basically in order to import the same schema that we have on the posgres single Instance into the posgres excel is that we have to Perform some changes on the schema basically we have to Decide which is the criteria to make the splitting of the tables across the different nodes So that's why it's not straightforward to put a More or less quite complex data modeling to a posgres excel. So it's not a direct It's not a straightforward process and you have to look into the specific queries that we are going to perform In order to speed up the not only the queries, but also the ingestion So these are some figures so from the ingestion point of view we have not get a Quite large improvement something just in but we have seen of course is that it's three times faster on the queries. So Regarding using posgres excel, we have to test so far different configurations But we have found that we need to find the balance between the number of nodes and Volume of information that you want to distribute across the nodes Because this type of Architecture introduced overhead So we have to take these in mind and perform different testing according to your specific needs So Apart from the test that we have done with a relational model We have done the same test with a single table and of course we have obtained Very good results with posgres excel and we have for a single table of 16 million rows and Select for everything. We've had god that in excel. It only takes 88 milliseconds Compared to the single instance with we have 540 milliseconds. So it's much faster So as I said, these are the final conclusions about the ingestion testing with posgres excel so The schema loading and marshaling i'm talking about the ingestion process is the most demanding phase and posgres excel improves the query performance by a factor of three and But as I said it requires a specific modification on the schema So you cannot import it as such into posgres excel. So what's next? We have to Scale out this posgres excel infrastructure and continue increasing the nodes and the coordinators in order to find the better balance and we will Test the posgres sql partitioning and also another extension of pg of posgres which is pg strom that runs on GPUs. So these are the next steps that we need to do that we need to follow And finally to conclude we want to show you what is the application on top of this database infrastructure and which is the application that is able to expose the data scientifically to the user And first of all this application Allows of course the scientific access of the data especially for euclid And this is how the archive looks like. This is the main home page and from here you can do meta data parametric search So we provide a specific interface that interface directly to the database So it's important to say that for that For this blank text over here text area It's mainly for adql, which is astronomical data query language, which is has inside a translator Let's say that translates this specific astronomical functions into sql So for that we use a specific posgres extension like like pg sphere and q3c So these are the list of queries that the user has already run So the user can rerun queries and I can basically download the results and share the results with the rest of the community and so on But also allows you to display the results online So this is in the in the format of the table. Let's say as we explained before the user can see specific stars For instance, given in a specific region of the sky and we can see a specific magnitude Give me the let's for example, give me the starts brighter than a specific factor So you can retrieve here the results So you don't have to download on your on your local machine And of course, as I said Euclid will also produce maps And this is the visualization online that we will have for the for the Euclid maps. So this will allow to Explore the maps dynamically and to overlap the specific stars that goes into this region of the sky and see dynamically Specific metadata on the sky So the idea is basically to expose all the data in a very easy way for the user But also to be able to make science with that So it's very very amazing And this is the basically the overview that the technology that we are using We rely on java and we use a google web toolkit for the web application And of course, we will use postpress and virtual observatory interfaces that basically allow us to Interoperate our application with other interfaces inside the astronomy context And finally we rely on for the authentication purpose So thank you so much for attending If you have any question, we are happy to answer yes Sorry Yeah Yeah Yeah, this is another another it's a very good question But a in principle the idea is to make science close to the data basically So all the infrastructures plan to basically to provide a specific cloud storage and Facilities to allow to the user around their own scripts close to the data and basically this means close to the database Well, this is about the triggering How to start yeah, you gotta say yes We don't have this kind of We don't use trigger so far That's why that's why we we are planning to We are doing this test with postpress excel. We want to see also the other possibilities like series data and So because so far our database is only 300 gigabytes and we have to grow a lot So we are really in the beginning Yes Is metadata yes Yes, and this will go grow up to more than one Yes, because across the different data releases, uh, the data is going to Reprocessed so it's not only that the data comes from the spacecraft But also the data is going to be represented should be also made online Yeah, the thing is that there are different Let's say data models and different implementations. So I mean to compare different things It's not so currently we are doing also some words some benchmarks Because we want so there is a request to study other possibilities than Oracle for the end of the day This year and the first step is to transform the to convert the data model Which is not completely as this the Oracle is subject oriented. It's not a straightforward to do it for For prosperous Because there is a lot of make references and so that we have to translate So we are still in this process. Probably we can present something next year Yeah, pg sphere q3c very powerful pg hillpicks as well hillpicks there is this kind of Dividing a sphere into a small Areas and again this again and again and again and this is really also very powerful well, we we haven't looked deeply into that yet, but this is something that is in our plans because Basically, we are talking about volume of data and we need to Know how to cope with that and we need to speed up where it's as much as possible So that's the reason why we need to look into this extension so First of all, we need to allocate the hardware to to make the test So this is basically our but once we have started this testing we will continue enlarging the the excel architecture to get the final results And probably the next step would be to to address this pg strong thing Yeah, this is this is a very good question in fact because we think about that and Basically as we explained in the isac science data center, we Wanted to reuse the experience that we already have in postgres So that's mainly the reason why we explore this option instead of any other ones But we are not close to that. Yeah, I can add to these questions that there is another mission using postgres excel Gaia mission So based on their experience we can learn Going to do a single bulk load was the foreign data wrappers more efficient than Golden gate or no golden gate is it was the most efficient? Yeah, but yeah Probably not I don't know we have not But as we are not using this kind of data This is for us it will be straightforward to move to one table to another so it's not complex model in that sense Do you screen plan? Well for the first time we will start from a scratch but basically before each data release we will have to Transfer some data before that will be only make available for the consortium members first In the same way basically it's using the same procedure Well actually Golden gate is very clever The way we will not use it Yeah, so maybe your question is more focused on the triggering mechanism how to start the process So, no, okay So you're not actually replicating changes to the tables they're discrete Like you're taking the whole table and moving it Well Not the same we can only do for as we said so in imagine there is a release So we have to put a flag on the oracle side and say in only this part So we use with the word clause the ones that we want to move Yes, because we have the The scientific let's say We have put the scientific way the data processing system. So there is a this layer on top Yeah, also because in the oracle side there is a row level security and everything So not everybody can see everything now so on so this makes things more complicated So these views on top of the of the oracle database help us a lot to to know what we have to transfer Oh But this will be but from this side Yeah This is something that we would like to address because the table size that we are managing especially for the catalogs It's up to 10 billion rows and depending on the number of columns that we have It's going to be the size but we manage something like 10 terabytes of the table of the catalog So we have to split that of course