 Zdaj zelo možemo začniti. Če smo vse začniti. To je tudi o Freeser. Vse zelo počutno, da je zelo vse začniti, da je zelo vse začniti, da je zelo začniti, da je to zelo začniti. Spetim, da lahko, da se početil kajte pridem rečenje in komponenta, kjer suno njim nekaj komponenta, tako da je to da je API, da je skupnja, ki je početil meta data o kajštih, da je ovo poslučil, svoj sem je obržena. API je to da je komponenta, je kratiljena na našličenju doreženju, in je valjene, da je zgodnjenje in začeljene, in odličenje, jen, odličenje, in so. Torkom zelo je, da prijeljene skupne, vse obradi, začelje in so, in si je zelo što je vse, že zelo, da je izviljena atreženja. To vse je, da tukaj baste celi obče in je izhenil tukaj zelo da je otejna kaj ko je svoj tebe in v nemastih možjev. So je tukaj, da tukaj je to, da je, da je, da je, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da, da. Of the agent. Another important component is the freezer scheduler that is a long running demon that takes the information, which retrieves the information from the API and writes back madrigs of the information to the API and executes the freezer agent accordingly. It provides scheduling features and also a kind of orchestration and dependencies in vse predstavljali z njegi. Zato je zelo potrebojne pravne vsega in ne vsega, ne vsega vsega. Vsleda je zelo, da je izgleda, kaj ima smo prišli, da se je zelo, da se je prišli, tako, da se se zelo, in so on. Kurrenlj, storajčne medije, vseh projekte, are three, which is swift, an SSH remote node and a log file system, for example, an NFS attached volume, and the interesting part of it is that you can execute, you can store the data in parallel to one storage media. So for instance, you can store the data on two Swift server in parallel with independent credentials, and also you can do Swift plus SSH in case you want to restore your data if the API or Swift or Keystone is not available. So you can do both or three and pretty much any combination. And this is general overview on the architecture and now we are going to focus more on what we did, what we have of most significant and then going forward or what we are going to provide. So the first interesting feature is the job sessions. The job sessions is, I think it's a quite unique feature because it allows the user to execute synchronized backups in multiple nodes. Whether the nodes are physical or virtual machine, you have in cases where for instance your data set is spread across multiple nodes and you have to reduce the inconsistency risks, you can leverage the job sessions to execute backups on multiple nodes synchronized. So how this works quickly, there are n jobs belonging to the same job sessions ID and there is the agent that pulls from the API and when it detects that there is a job set with the job session to the other nodes with the job session ID tag the backup and the action are executed accordingly. So how do we manage the fact that simultaneously on multiple nodes the data is changing and that is probably the biggest thing. So there is no 100% safe solution to that. What can be provided is reducing the risks of data corruption. So in freezer in this case the first task in the job would be to create a file system snapshot as much as possible to the reduced time window on all the nodes. Disassume that all the nodes are synchronized with an NTP server. So after the snapshot on the file system is taken with the most reduced time windows as I said so we have a point in time data snapshot then all the backups all the freezer agent on multiple nodes can take the backup and all belongings to the same job session. Another feature that is to be delivered is under review if you fancy a code review it's the block based incremental. So the way we are providing a block based incremental it's by using an r sync approach. So we are doing an r sync implementation in python. It's slightly different because with r sync you need to have source and destination file available on the file system but in the way we are doing it the freezer agent basically run through the file system so for each file generate a signature hash for each block it's stored in metadata and then on the next run that metadata is retrieved and the hash are matched against the current execution. So if there is a change in the block the block is saved otherwise not. So this approach can be used with an object storage basically. So we are offering providing two algorithms which is the rolling check sum with one byte shift so this is useful to check repetition of the blocks within the file at any offset. The problem with this is that is time and resource intensive by other side it provides back bandwidth and storage efficiency. The other option we are providing to speed up the thing because it's quite like starting from medium to other side it's slow we are modifying algorithm and we rather than having a window of one byte to check the existence of the block within the file we shift block by block so the block size is the shift rather than one byte of offset. So this reduce a lot the computation of hash and also the matching computation. Another interesting thing of the other side is that the restore is in stream as well as the backup so when you restore the data even incrementally there is no need to download the data prepare the data rebuild all the data set and then move to the location where the application points to read and write the data let's say additional space requirement and costs for in the restore. Of course the actions happen the computing is executed on blocks of data in memory so there is some memory usage. So now let's go to the challenge a bit this is the questions that everybody are having so same question different service and it is how to scale how do we scale and what does mean scaling in backup restore and disaster recovery also so how can we execute incrementals on more than 500 terabytes the reality is by fact that lots of users companies and environments have storage and way more than 500 terabytes so how do we approach this the truth is being totally honest and transparent is that the rolling checksum needs so the arcing approach needs to re-read all the data every time to generate the block hashes, the block checksum to be compared with the hashes from the previous execution this is a problem this is a high issue because re-reading every day to check the incrementals of 500 or more terabytes it's hard to do mostly time intensive and so on so the current approach doesn't work I say that the options are we are working on the two options the first one is to get the list the backend storage of course the backend storage needs to support it or leverage the features of some advanced file system like ZFS and even butterfly system but we believe that we arrived to the point where we need to write drivers and to better integrate with more advanced commercial preparatory storage solutions that can provide these features so the thing will be that when the block is written on the disk there is one technology like the storage that keep a hash table of that will retrieve that hash table and we match it so we do not need to re-read all the data to recompute everything so this is one option the other option is Arsync algorithm change that I mentioned before that will be rather than using one byte window shift we use the block size this will go faster than the rolling one byte checksum the thing is that we still need to re-read all the data but it works in case the drivers are not provided or for some reason you can do it we are opening to with the drivers to the business and we are also providing out of the box solutions that can work another cool thing we are working on is the duplication because it doesn't matter the solution is open source or any real enterprise grade and advanced solution of backup provides the duplication this is the reality so what are the challenges so the first challenge is how to achieve cross tenant duplication so if there is another tenant completely unrelated to you or to the current tenant basically there is a block data that is exactly the same to yours the duplication still happens if we implement successfully and we are going to do it this kind of duplication in this case there is no need to use the incrementals anymore so there is quite a big change in the paradigm and there are other commercial solutions that are doing it but we are really pushing hard to provide something equal or better so another important thing is that we need to maintain a global table of the hashes with the indexes so each block for the data belonging to every tenant needs to be indexed with the hashes and we need to know we need to keep that table and query that table very fast so it is more likely that we are going to introduce a new component called the freezer did up and what the basic workflow is that it will get the data from a stream compute the hashes and send the data to the storage if the hash if the hash does not match with one of the existing hashes in the global hash table and this should run where most likely where your data is so we get bandwidth efficiency of course there are risks and limitations for cross tenants and what are the risks and limitations so the encryption if every tenant encrypt the key encrypt the data with his own key it's a problem because when the other tenants needs to transfer and it's that block of data that has been encrypted with the other tenant key it doesn't have the key so it cannot be restored so the encryption with the same key overall the data increase efficiency so this is one risk and limitation the same compression algorithm for all tenants needs to be used and this will become transparent for the users because it's kind of the same we are going to implement like something that will detect automatically which compression algorithm were used for each block we need to use the same compression algorithm another issue is the guada management why because with this approach we are going to have data shared across all tenants it will be not easy the guada management and knowing every single dataset sites belonging to which tenant but we are going to do that in freezer so that is something that needs to be managed another issue is that potential issue is that the cross tenant data goes to the same bucket so we cannot provide it's not possible to use advanced security strategies like encrypting and splitting the blocks on different storage media, on different location so even by encrypting one block only the attacker couldn't do nothing so this is a big problem because we are most likely to have to place all the tenant data in one single bucket so now we are going on disaster recovery side and we will talk again later thanks hello so the freezer approach so far backup restore can fit the disaster recovery part of the in some occasion customer requirement sometimes are a little bit more strict in timing and so on so we are starting to analyze and think about how to have kind of a different approach and implement something that is more live so we started kind of a deep analysis on which which are the kind of problematic we are trying to solve when disaster recovery is needed and how to implement a solution to address the most common use cases so starting from the workload for example there is to be kind of one is the native cloud application that are usually done to avoid the problem of disaster so they are massively scalable they are distributed over multiple data center multiple rooms multiple continents even and in this case there is little that need to be done for disaster recovery there is probably some needs on backups because some vital data needs is stored somewhere to avoid even human mistakes or losing part of the data anyway where the disaster recovery is really needed is in case of legacy application so there is a very long debate about legacy application in the cloud or not in the cloud the reality is that the world is willing to move the legacy application to the cloud and then maybe start to rewriting them in an appropriate way in that case yes backup is fundamental and solutions for disaster recovery are needed because losing even a single virtual machine could disrupt completely the service they are providing and this is what motivate us to search for a solution implemented so we started to analyze the possible causes of disasters and identify where those disaster can happen inside of the infrastructure our cloud need controllers and controllers are really kind of providing replikable situation so all the open stack for example are made to be distributed quite easily so losing one controller is not a big issue no disruption in the service at least the storage if this kind of resilient solution is needed need to be anyway distributed and scalable and again losing one single node is not disrupted services compute nodes are a different story because there is not replikable way of managing them virtual machine run on each virtual machine run on a single compute node if a failure happen on a compute node then we lose virtual machines the network is another problem the LAN for example the local network can have issues and if that thing happen we can lose some and a lot maybe of all the other components so human intervention is needed in that case if let's say internet connection could be another cause of problem could be disrupted service for who is providing the internet connection or even a DOS attack that will exclude from the internet that is end these are things that somehow we can try to address so going a little bit deeper we kind of tried to map what the causes the damage and the possible remediation to address these failures as we said controllers we can lose one or a few of them without disrupting service we have issues if we lose one entire site as we said legacy application will be usually in a single to the center losing a site no application anymore compute nodes losing one node is an issue something can be done if we are using ephemeral storage that's even more complicated it's not that visible to have that kind of approach on the local disk of compute nodes because if one compute node fails we even lose the data not only the virtual machines if we lose one entire site no compute nodes anymore the network as we said we will try to do our best so first approach is the compute node failures freeze the architecture as we said before it's kind of allowing us to without much changes but some modularization of the code to have additional features so we try to address the kind of workflow and the functionality that we need to implement an automatic failover of the compute nodes so the approach that we are taking is to leverage the what open stack provide as much as possible starting with a health checker that is it's polling on the now IPIs fetching the status of the provider and detecting which one is marked as down we have the freeze agent as my colleague said before in deploying that in our compute nodes we can gather and provide information on the status of the node we have monaska or even other monitoring solution monaska is the open stack and specific one that we can gather more information and try to decrease defaults positive as much as possible because for sure we don't want to evacuate a compute node that in reality is still running and we can even do more advance and detect through IPMI power status or the health check of the nodes or watchdog hardware equipment if available once the failure is detected the first thing that we need to do in this mandatory it's to fence the nodes fence mean kill it definitely and be sure that it will not come back because it will cause very bad inconvenience like mounting two times read write a volume and the operating system will get crazy possibly even disabling the nodes or putting the node in maintenance so NOVA will not even try to use that node that node anymore once the host is fenced the evacuation can be triggered and even maybe the live migration if we realize that it would be useful once evacuation has happened notify and notify the impacted user and tenants of that node because they are the only one who have access to their own virtual machine and the one that can detect if the application are up and working again so we really have a PC of these and it's working quite well we had to change somehow the internal architecture freezer let's say the code to be plug-able so all these it's going to be implemented like a custom-isable workflow where the cloud administrator can plug in whichever of these functionalities they want this is an overview of the architecture of implementing these inside the freezer so behind the freezer edge on top over there is a thermal freezer backup way, the scheduler and the agent starting backup in storage let's say is with or what is chosen but the scheduler will manage even a new agent that's the dear agent and we identify that agent could provide this monitoring of the local node do a kind of graceful fencing so graceful shutdown of the node if it's not completely failed but we want to evacuate that anyway we can like we do in normal freezer execute custom action on top of this before or after the events are happening and then we can use that to manage the network part that need to be taken care in case that we are doing this on different centers and we are going to see more details on that later behind the freezer apis there will be a freezer engine that is a long running service that is kind of the orchestrator of all the action that is going to happen when the disasters are happened so once we had a clear DNA solution for the smaller problem of single nodes failing or a low number and we want to move the workload inside of a single site we started to think about how to do that in larger scale and have a disaster solution between multiple sites so the first point is to have a segregation say division between the two sites and availability zone or host aggregation is probably the best way to achieve that and then we started thinking about what is going to be needed to have a disaster recovery solution for the biggest problem is the replication of the data so without having the data in the the data center it's almost impossible to restart the workload over there so the three main kind of data that we need to replicate the database used by OpenStack let's say MySQL it's possible to have a distributed cluster it depends on the delay of the network and a few things this can be done in a kind of master-master way or in a master's lay way the block storage as we said before the advice is to go in that direction avoid ephemera if possible but the standard backends used for storing the volumes need to be replicated over the data centers again there are open source solutions there are proprietary hardware specific there are solutions to do this kind of replication depending on the delay it could be synchronous or asynchronous it's on the cloud administrator to design the cloud in the appropriate way to have this happening the other storage that we need to replicate is the object storage usually is used at least for glance as the backend so it's an easier problem to solve than the block storage and most of the object storage solution have this capability have at least one replica of the data in the secondary data center and so we design the workflow that need to happen once a disaster sadly happened like for compute nodes we need to be sure that the failed site will not come online again there will be collision in the network there will be writes that go through the block storage or the object storage when not all the solution are able to have an active active between other center implementation one of the first action probably would be smart to power on the compute nodes the controllers usually or at least some of them need to be always alive to receive the replication but compute nodes we don't want to power them on all the time doing nothing first the next step will be to be sure that our databases are read write mode as I said before if it's master slave probably we need to elect to master the slave databases and remove the old master from the cluster together consistency if it's master master we need to maintain the quorum again so we need probably to kick out of the cluster the failed nodes in the failed center and having our let's say replica database able to write similar thing to be done for the block storage if it's distributed we need probably to gather the quorum because the split brain problem here is happening all the time is to site one is failed the other one don't know to be the one that is in charge object storage is the same need to gather the quorum you need to probably kick out of the cluster the failed nodes probably even decrease the number of replica because the replication will start after the failure and probably will fill up the space and these kind of problems and then there is the real evacuation so if the architecture is done correctly we can restart our virtual machining in the recovery site and then the network part is the last step that we need to take care so our floating IP for example are rooted through the failed site we need to do something to have that to the recovery site so this is how we are implementing this you can see there all the replication that go through we have our database our block storage object storage we are willing to have even up a solution that can handle active active sites but the each one will receive the replica of the data of the other one in that case so again the limitation and what is the starting point to implement this is the network the delay is the main problem here that can limit possible solutions network failure this is our idea so floating IP are not normal IPs but are any cast addresses and we can leverage for example exaBGP that a very useful tool that through the freezer the agent that is on all the compute nodes he will for every machine that is start over there he will fetch the floating IP of that machine and start to announce them to our autonomous system once disaster has happened and we are evacuating the center the agent that running the compute node on the recovered center will detect the new IPs start announcing them through the autonomous system the traffic will be reacted automatically thank you if so if there are any questions please go to the to the microphone we have been a little bit bit probably sorry just one question because you mentioned that the engine is making the decision which host needs to be rebooted fenced and so on that is usually the tricky part to detect properly and reliably which host is down and which host is up as far as I understood you basically rely on the tron status of the service status right? so we start from there because you know both nov and neutral have continuous communication between let's say the server side and the client side if they are detecting a host down we have issues so it's at least the starting point that note it's not usable anymore there could be virtual machine still running there and working but we are not for sure able to spin up new virtual machine no so I'm just what I'm trying to say that basically I don't know maybe this is not the best reliable way of detecting this because there are projects like pacemaker which does it and well we all know that probably sometimes like communication between the nov compute and conductor is not perfect and sometimes it hangs you have the status of the services down but basically the VM are still up so definitely I totally agree we have been thinking about this for quite a long time and even pacemaker or similar solution to be implemented but from our point of view it's going to add more complexity you have something more to maintain and manage and when you scale out it's another component that you need to reconfigure add and take care of so for sure nov and neutron it's not enough for us to be sure that the node is failed it's only the starting point for go through a list of other checks and only if all the check that host is down like power source is gone you can go through epmi tool epmi and you will see that the power is off so the plan is basically to replace this kind of nov status check or neutron status check by something else more reliable it's only a list of checks we don't advise to use only one but to use a list of them and even the idea there is to have kind of if all of them confirm the host failed we take action okay thanks one thing it's that we need to use first the API from OpenStack services that provides that information because it's fundamental that we integrate with other OpenStack services and then on top of that we can use other also because that services are available more likely so that was another choice but that is definitely a good point yeah definitely hi two things the integrity protection features is really impressive because in the recovery part when you recover the virtual machine in the recovery side it seems we are re-spending the VM with the brand new UIDs for VM the center block the neutron ports everything right these are brand new UIDs they are not retaining the old UIDs from the crashed one they should because the idea there is to use the nov of a quick that they do a good job maintaining the right IPs and floating IPs and the UID so it's simply an eva question of a failed node and nova will maintain all the ideas okay how about the neutron ports it's supposed to do the right thing and maintain and recreate the open switch if it's in the right way that was my concern because if there is a brand new stuff somebody has to go clean up the old stuff in the DPs definitely so the most challenging part will be probably to restore the failed site when we want because then we even need to re-synchronize everything in the other way around and it's probably more challenging than managing the failure managing the restoring of the standard situation but you know we didn't want to reinvent the well so there are a bunch of guys that are working very heavily on the host eva question and the live migration in nova and we really would like to use their efforts and not reinventing everything thank you and one thing is that we waited the summit to finalize the DR blueprint so the blueprint is there with the requirements so any improvement anything that you think is wrong just go there, right there put minus one, minus two anything you need and let's do it so any other question? ok so it was honestly a pleasure thanks for your patience and thank you