 Ok, ir paskūrėtėme. Tadą jis N. Rika Loeve. Jis trumpatis, jis pradžių dėlėjų projekti, ir jis šiandienas klausintių kitus, paskūrėme pėkų komponencią, kivėme, apie kompontenos žaidžiaus, pėkų rungtynės, aš kai apie kompontenos žaidžiaus. waterings are a bigin. We started to plan architecture in the middle of last year. So we want to build an aspey architecture which should be scalable to say to 10s of X or something else. ir galėjome išsiruojų. Aš ir jiems rungtynės dėl, ir jiems supralėjo spėdamos rungtynės, nėgėjome jau vietuos rungtynės, jiems jau daug nesakėjo. Aš ar anti, trimėjome rungtynis, išsiruoje, ir bet mūsime rungtynės, because it has the richest features comparing to any other and probably right now we might say we did try to choose anything with self which is impossible any other distributed file system in the means of that placement. So then we need to choose hypervisor, of course it was KIVM because of fine grainy tuning via C groups and then shader has a lot of problems before 4.2 release so we try it out unfortunately. We also want to build our cluster on relatively cheap hardware, it should be supermicros or something else which is supermicros, they are relatively cheap and we built our storage network on infinity band because storage network may remain flat, it doesn't need routing. At least in scale of rack so it's safe choice here. Also one should mind that things can be acceptable in the private cloud say scheduled downtime is completely unacceptable in the public one. We can't shoot down the cluster in the Sunday night for maintenance unit just for one hour, it will not meet downtime criteria and public cluster should be very reliable as regular hardware is accept regular hardware ways and brings down all the VMs on this end. We shouldn't allow this, so self help us to do this. So why we choose self? Because it allows custom data placement so you can group data in buckets, you can place VMs in these buckets and so you can pin those buckets to set of physical nodes. I might say rack, I might say subset of rack or superset of racks or something else and you have absolutely free choice on what to do with your data backup copies so you may place in same rack, you may place in another rack and if your rack suddenly dies you can relaunch all the VMs. In the minutes somewhere else without expecting any other problems. So self also has very advanced QVM driver which is very, say, able of caching, able of customization, we did some customization for cache. Unfortunately, self mailing list was rejected this because it was just to improve latency and not just architecture improvement. Then for first version of our cluster we choose flat networking with one isolation because it's quite simple, the same mechanism exploited in the OpenStack Nova and right now it's of course obsolete and we have deleted it too. And there was one good point that we can hold in chair configuration inside LibVirt, virtual machine config contains all necessary elements, we don't need to make any additional actions when we launch VM and simply start it and it will work and isolation will work and no other clients will able to see part of it. This method has some disadvantages as you cannot build, cannot scale to segment up to, say, 2,000, 3,000 VMs, they will drown in the broadcast messages and your switch probably will not support such many, support serving such many lines at the time. For first version also we selected CollegD because it has monitoring plugins for SAF, for LibVirt and for a lot of system parameters. So using CollegD seems more prefer to date. And of course, as I mentioned before, QVM hypervisor was chosen because of fine tuning, because of community and we have a lot of experience with Zen and want to try something new. So seems we need just to write some orchestrator logic and some UI and just start working. That's OpenStack way and so why it may not work. Right now situation is quite different but a year ago we have a lot of problems. The SAF currently first table release doesn't work correctly on the intermediate states. We was not able to rebuild cluster to change data placement in the cluster without any hurdle caps to the input output. So we postponed production because of this reason. And RBD driver for QVM has at least three, as I remember, different reasons for leaks. They have been reported to the SAF list and they had fixed it but it takes a time and we also lost months or two before all those problems were fixed and pinned. So exactly at the time we are using LIBIR 0.8 or 0.9.11 which has not always support. It came with 0.10 and there is no ability to take incoming or outgoing traffic properly. Inside the virtual machine port. So we postponed these two. And we eventually launched first version of our orchestrator. Our orchestrator works quite differently than OpenStack, CloudStacker or any other existing type of cloud orchestrator. We put everything for local node management to the local one local agent to prevent synchronization issues which are now taking place in the OpenStack, for example. And all the operations, gest agent messages, changing C group parameters, migration, start or shutdown was passed through single instance. So there is a point to do things exactly in this way. This is not modern design but it has a lot of advantages like this. So of course LIBIRT has existing infrastructure to support any type of limits. So we may use CPU for quota limitation, CPU set for pinning and CPU act for accounting ticks and billing and users. So it's just a statistic come out from LIBIRTD and we just need to use it. It's quite simple. Also we had a problem with same RBD drivers that VMs eventually leaks. We are not able to fix it properly. And we introduced delay of mechanism as Google did in 2007. As I remember, so when VM triggers for an event inside the C group, we immediately move this VM to the Fridges C group, then add some memory, some limit above then unfreeze it and then immediately migrate it. So this leak is kind of resetting and we can still use this VM. It will not die. It's quite ugly but it works at the time. It took a lot of time to implement exactly this feature. And of course we did some statistic preprocessing on the local agent because say 1000 VMs generates about 5000 statistics events per second and no relational database able to hold this correctly. We need to approximate say 10 seconds interval but we need precise statistics so we do such preprocessing. We heavily use chemo just tagging with some patches which was rejected by community. This is file open, file exact and file closed editions. So we are able to launch any possible commands inside the VM and able to read or write some files, any files inside the VM. The first thing is which is very usable for this snapshot creation. You need to do a first freeze before creating a snapshot even if it's atomic. Then you need to make snapshot in case of theft. It was atomic but it's still break a first if you don't pass freeze call. Then you tow it back and easy to for what guess again made for. Also we can do some change of network configurations when we clone VM so we simply clone VM inside the theft abstractions. We say snapshot, then flatten and new VM appears. The theft has such mechanism. It's very important thing. In any other way cloning will take say tens of minutes or so. So we just write settings and fire up new VM. It takes say 10 seconds. No other cloud orchestrator have the feature which is strange. No one hardly like to theft even cloud stack. Of course any upgrades, say upgrades of physical node, upgrade of kernel, emulator upgrade can be done via migration. We can do seamless migration with theft because at first we have live migration and theft is distributed storage which is available from any point. So we need to just push a button, wait a little and not will be free. Then we may down it and do any other stuff. Of course we did also automatic migration based on node metric to prevent resource exhaustion. It's an automatic feature when orchestrator sees that one of parameters CPU or memory are close to their limits. Then most greedy VMs may be immigrated to free nodes and node will die when user adds some memory just above current limit. And we choose write based management granularity because there is no point to make subscope management for rack. And larger scope may have points, may have problems that the amount of statistics will come relation to base. Un we have a statistic preprocessing. What we tried in and what we turned out, that's exactly pinning corresponding to the normal topology. It has very small effect on regular user applications. A regular web server or small database simply will not rely on the normal and we can simply try it out without any problem. Custom TCP Congestion protocol will work only if we have some problems with long distance link, but for short links any protocol will work. Cubic Hydra we tried DC TCP at most and for storage network if you have fat short links it doesn't matter. Also we tried, we built and turned out memory automatic shrinking and extending based on user feedback. We plan to introduce something different lately, but this is kind of failure. We cannot extend memory after we already have event of an ability of memory allocation. It's quite bad application is already failed and seems that's a problem. Nobody wants also CPU hot plug or hot remove because quarter is enough. There is very small set of application which needs one powerful core except say per core licensed databases. And I said about Neuma password, we don't want it simply. That's main performance issues we faced just before launch. As I said, we simply turn out Neuma. It doesn't matter for most kind of regular applications. Transparent huge page usage will be fine. It will reduce amount of context switches, reduce overhead, but there is only one problem. It was not working with virtual balance at the time. So we simply cannot use it with changeable memory. Idea driver also was awesome, but it limits only on 70 megabytes per second. It was completely finished, but it even support discard, can't use it if we plan to extend it in the future. So we roll back to the virtual driver, which is quite simple. If someone knows this architecture, virtual driver has a simplificator architecture and it still doesn't support discard, only virtual SCSI does. As for CF, we forced to move out from BTRFS because of its state. It's quite unstable, and it remains unfortunately unstable, but it has almost zero cost snapshot mechanism, so you can roll back and forward snapshots in the CF without any additional overhead in the file system. It's atomic operation. So we turn all it out, fortunately. At the time of release, we have kernel 3.6. All the kernels in 3 series, as soon as I know, after 3.6 has a very unstable behavior under CF and heavy load, so there was a lot of problem to pin exactly this bug. It's a kind of soft lock up and throw it out. CF also has even time of bobtail, many leaks, and we are forced to restart them eventually. This behavior can introduce no problem until you're looking for memory consumption from them. Open the switch in time of 3.6 can be turned literally off with 400 RPS, especially scene attack for different IP addresses. So that's a kind of problem. After event in production, we have a kind of problem. There was users who want to scale their VMs by order of magnitude using bell mechanism. They don't care about mechanism. They simply want to scale their VMs. Kernel resurs all memory and then bell and driver loads and says how much memory kernel should expose to the user space. So for 1 GB VM, which is able to scale to 60 GB, overhead will be about 300 MB, which is quite unacceptable. This overhead may be eliminated using kernel same page measuring mechanism. So we tried a lot, stayed only on regular one, which is not crashing immediately, and UKSM, which brought us some memory location failures, so we decided to not use it all. And despite this problem, we still want to use online memory addition and removal. And we implemented, we took ACP hard plug, changed it for a little to make it work. So it requires some changes in the CPUs and in the emulator itself. Unfortunately, these changes are not being accepted in the main line, so we are using it standalone. As you can see, there is a lot of slots in the camera command line, every of which is pluggable. So we can plug slot of memory making online and make it offline. And recently Linux kernel has a mechanism to offline properly during operations, so we can do offline without reboot official machine. As you can see, Ceph is very awesome in means of features, and third release is called name control fish brought in a lot of features extended intermediate state handling called PING. So we will not stuck at PING for minutes anymore. When cluster decides which data where should be placed. So the only one problem is fine tuning. A lot of changes should be made in the default config to make it work at least. Number of different queues, file store queues, if someone similar with Ceph, he'll understand me. PING priority, recovery priority and so on. So at the end of such tuning, you will have really distributed vital system with second or two downtime between any states when you are not discarding more than n minus minus one copies of data. So there are also some assumptions server itself user should know. It requires very precise time sync. It depends on link bandwidth, so no one should try to put F on gigabit links except for testing. We are using quadrate, quadruple rate infini band for 40 gigabits. It was enough for about one gigabyte per second linear reads and writes in same time. Of course we should combine storage and compute roles to reduce overhead in physical nodes. We cannot allow ourselves to isolate compute nodes, say it can be not without disks just computing one and put above somewhere else storage node. Mention with a lot of disks but with limited memory resources. So combining seems to be a preferable strategy. The only thing we should observe is a potential resource exhaustion which can be avoided using same sync group mechanisms we already exploiting for the VMs itself. And the most problematic part is the client input-output latencies. For retaining media, we have very low amount of maximum operations per second. It's about 70 which cannot be raised without any type of cache. So controller cache raises this to about 100 operations per second which is probably not enough and we'll see raised latency for reads when we have a lot of clients, say 10 clients per disk doing 5 reads each. It will bring our cluster almost to nice. We need to cache it somehow else. So we use DM cache as far as I know it's the first kind of this experience in the world. DM cache is pluggable. It's entry, it's open source. We may freely modify its policies which of course I made sure today. And it helped us a lot. We reduced latency for 99% of our request by order of magnitude from 100 milliseconds to 10 milliseconds. It's almost SSD latency which is quite good. And we have a ceiling about 100 milliseconds today. So no read will take more than 100 milliseconds just for starting to apply. If read is big of course it will take a lot of time to answer the client but it's quite good in means of meeting requirements as for distributed storage or so. So there's also some word about self tuning. You should tune operation priorities first. So recovery priority should be put down and user priority should be put at highest level. So they will not interact and user operation remains at highest priority. And when you grow cluster you will face different kind of problems. We say 5 nodes and 5 nodes. In first case you will see bandwidth problem first and in second case you will see IOPS limit first. And of course bigger cluster is the less amount of time it takes to pilling it's internal self operation to be completed and the less freeze for client's operations will be. So what about choosing network platform? We want to use AirSocket support for self. AirSocket is within RDMA so it has lower overhead than regular IP over infinity band. They are not ready yet, we are trying to use it and we are actively working on it. So until we have need to migrate VM cross racks we may stay on flat storage network, I mentioned that before. So we may stay on infinity band which doesn't scale on means of regular Ethernet network scales. And for Ethernet segment we choose hardware to build current network and which will remain in the future when we completely switched on the open flow. No one of famous vendors have a switch with completely open firmware it has only open flow layout with some version of compatibility and we choose pk8 because it has completely open firmware excluding Broadcom model and it runs open with switch and which can be put in the open flow mode or can work standalone and of course it helped us a little to build our current network. So why do we choose open flow? Why do we switch to it? Because in standalone OS you cannot filter per out per switch set of max you may set filter per port but when you try to put the same rules on the switch which stays over couple of VMs the same rules could be difficult you may not know which MAC address you asking and what you should answer to him or should you drop him. Of course this means that the public segment can be filtered only using IP tables or so so this is kind of problem we need to eliminate this and of course there is no able to filtering there will be a lot of broadcasts and it also may stop our scaling say at 1000 of VMs and of course we need traffic pattern analysis to prevent potential spam because we don't want to be put in the blacklist and when users send say 10,000 emails we can check if it's not spammer and we can prevent this before abuse happens OpenFlow may help us to do this so we choose to switch to OpenFlow at once that was quite simple because we already have VM to VM relations so these two VMs belongs to the same client if so then the packet from local network from one VM can be forwarded to the local port of second VM or so and we also do MAC filtering and IP filtering with OpenFlow so packet which seems to not belong to this VM by our logic will not pass even input port filter so it will be fine so we choose FluidLight because it was almost majorized Open Daylight and somehow it was simply provided extension to him to work with our constraination instead of the Open Daylight so it's hardly to use POX or NOX because of their current architecture and we can build something like cross IDCE overlay networks in the future with very small form and just need to add some logic on the top of current one that should be no problem and we still not finished multiple controller synchronization it means that the OpenVisage can be used for a set of controllers and in case of failure it can switch to another and same rules will be pushed then the minute of thinking about current state of software network and software defined storage since we can move VM everywhere inside our data center do we need really to build overlay networks for them inside the same data center because we can reduce distance to make possible to work regular level 2 traffic not the overlay one that's the question itself allows you to make atomic snapshot as I said before and we can freeze virtual machine replace file descriptors on the fly and switch to the snapshot and unfreeze VM it will work as with original image so we did migration in such strange way and of course we can migrate VM lately when we migrate all the data so right now we have only filtering and forwarding task in the open flow so filtering was done in the scope of one host and later we plan to extend it to the hardware switch so a lot of amount of broadcast will be reduced because port will be broadcast from one port will be forwarded only to this port which is known to contain any other VMs from one port so it will be a point so what we have right now no automatic ballooning yet in the Linux I mean automatic ballooning should never return Yeno mem because it should add memory first when we return Yeno mem because it should add memory first when we doing request and it came additional memory request and it will came in slow pass then we just dedicate some memory and balloon will extend and slow pass complete with success application will never fail we plan to implement this mechanism by ourselves in say in half of year so despite in recent kernel release we have merged not we but Linux community have merged ICP memory shrinking mechanism so we can put a removable region on any real memory region and then offline it and then remove it it seems to not working today so no it's not able to remove memory from the host without using ballooning mechanism so that's a key point too transparent huge pages despite this they are giving some performance boost are not working with balloon and what is more important they are not working with memory or commit so we can allocate only predefined amount of huge pages and use it you cannot allocate a lot of virtual memory allow this error commit and transparent huge pages on it that's the architecture problem and we hope hopefully see it will be solved near future and for distributions there is no distribution contains from stable set of course I mean it doesn't include Fedora, Arch Linux or something bleeding cache no one regular distros precise Red Hat and so on contains new kernel with automatic sorry, with online partitioning resize so we cannot resize partition without reboot and it's kind of stopper for our work we had implemented online partition resize using just agent, notification through chemo and couple of distros still can't use it so what we may do in the near future despite things I discussed before so we want to build a Fed network segment on top of current one and put two level silos and move this entire construction to the open flower to make possible advanced traffic engineering so we can reduce amount of traffic coming from any to any cross-section using only our software logic nothing more we also plan to complete work with Cephar sockets to make Ceph and Glaster stay on same line here because Ceph doesn't support Erdime in any way and we want to write and put try to put in mainline after ballooning mechanism as I said before which has no failures when memory tried to be like key since I have some time so if there are any questions some of my backup slides that's kind of talk for VM scalability we thought about distributed virtual machines in the beginning of our development so virtual machines and current logic can be easily converted to work with distributed memory and distributed CPUs but it needs to it needs by very specific kind of applications say no malware databases even MySQL is no malware but nobody will put it in such large environment so we decided to not continue work on it just one of needed features the second point is about our clone mechanism as I mentioned before we can clone virtual machine and launch it in matter of seconds using Ceph flatten mechanism and rewrite any sentiment related to hardware identifiers and to the network so the point to extend this to support at least in the cloud stack because open stack will not rely on the Ceph as cloud stack does and we are waiting for RDMA chemo migration which came in chemo 1.7 and still it's very experimental it will allow to reduce time of migration from current seconds for offline migration to second or two and will reduce performance overhead for the live migration such things as it's bizarrely is in place but they will have only with specific kind of workloads not the regular ones spinning up very large VMs and putting just a bigger web server on it it's bizarrely will not help that's all so I still have three minutes okay so it's like to a remote site geographically it allows it called georeplication we simply just need to set it up and it will work flawlessly in candlefish and new releases I'm thinking of in terms of your memory bandwidth problems your own infini band obviously we can put one long range infini band it exists and Melanox selling it right now so it's not a problem at all okay, so no more questions you're using dm? no it's we are not using kernel drivers so dm cache is plug it only for file store so no more questions then that's all thank you