みなさんこんにちは。クバネテススレッジマネジメントの 経験についてお伺いします。クバネテススレッジアサイメントの オプテミゼーションをお話しします。クバネテススケジューションを 使用しています。クバネテススケジューションを 使用しています。まず、私を紹介します。私はケンジモリモトです。サイボーズを使用しています。私はモリモトサイボーズの アイデアのインタビューを使用しています。サイボーズは日本の IT 公司です。クラウドサービスに グループウェアコミュニケーションを作りましょう。私はクラウドサービスに アンクエンジニアを作りました。クラウドサービスに 多くの2,000 人がいます。この様にあるインタビューの デュースを作ったものもツールとなったものが コレクションのコメントで作りましょう。私は最近 目 Johnson のマニュアルを作りました。私がクバネテススと カラーネティープの技術を使用しています。私が今、 今のアジェントを紹介します。まず、私が最初に 試していきました。私はクバネテススケジューション ステップを設定してみました。I'll recap the basic of volume management in Kubernetes and define the problem.I'll describe a basic idea for the problem and then inspect troubles in implementation.We got an unexpected result under the real-world disturbance, so we tuned the kube scheduler configuration. finally, I'll give a demonstration how our tuning improves the placement.First of all, let's take a look at what is a distributed storage system.A distributed storage system organizes node local storage devices in the computer clusterand provides a unified storage resource for users.In the case of SEF, for example, the administrator defines object storage devices or OSDs in short using local disksand SEF gathers OSDs into a storage resource.The management of the element disks in the distributed storage is very tedious work.Before seeing root and SEF on Kubernetes, let's recap the storage architecture of Kubernetes.On Kubernetes, storage is abstracted as persistent volumes.The storage can be local disks, network storage in the cluster, or cloud storage.Pos can define the volumes by using persistent volume claims as dead sources.Pos isn't volume claims specify their requirements for underlying storages through storage class names, resource requests, selectors, and so on.Kubernetes finds or creates a matching PV for PVC and bind them together.We use Rook to deploy a SEF storage cluster.Rook has a mode to configure OSDs through persistent volume claims.Rook configures acquired PVs as OSDs for SEF and then SEF constructs a unified storage resource.What we wanted to do was to deploy a distributed storage system on our on-premise servers.We had experienced that the management of thousands of servers and disks was an awful task if done manually.So we needed Kubernetes help for automatic management.However, there was no standard profile to deploy distributed storage system on Kubernetes.We can expect that SEF is responsible for replicating data across failed domains for robustness.On the other hand, distributing local disks evenly is a task for the administrators.So the challenge here is that we need to distribute persistent volumes for local disks evenly through persistent volume claims.It's easy to achieve even distribution if all disks, all servers, and all racks evenly available.Let's take six disks evenly from these.It's easy.But in reality, this is often not true because, for example, a server node may have broken disks.Some disks of a server may already be assigned for other use, or a rack may have fewer healthy servers than other racks.So in the real world, we need to consider about uneven availability of local disks.This makes our challenge a hard one.In order to achieve even distribution automatically, we need help of Kubernetes.The Kubernetes does not directly schedule storage devices.In contrast to storage handling, Kubernetes provides a very rich set of features to schedule pods.CoupScheduler is a Kubernetes component to schedule pods or nodes.It can schedule pods based on several criteria, including resource requirements, node vectors, pod affinity and anti-affinity, and taint concentrations.Now jump into the basic idea.Our basic idea is to specify wait for first consumer for the volume binding mode in a storage class.But what is the volume binding mode exactly?It can take a value of immediate or wait for first consumer.Let's see the behavior of Kubernetes for each mode in the following slides.The volume binding mode affects the timing when the binding of a PVC and a PV is determined.The default volume binding mode is immediate.In this mode, when the PVC is created, Kubernetes immediately finds or creates a matching PV and binds them together.This binding will work as a topology constraint for CoupScheduler when it schedules the pod that uses this PVC.In our case, the PVC is bound to a local disk, so the consumer pod is bound to the disk's node in effect.The problem here is that we cannot control the matching of a PVC and a PV in terms of even distribution.Another volume binding mode is wait for first consumer.In this mode, a PVC is not bound to a PV until a pod that uses the PVC is scheduled.We can control the pod scheduling by several means.When binding a PVC to a PV, there is a constraint that a PV must be available from the node.As a PV is not local in our case, scheduling a pod is equal to scheduling a PV.So we can control the location of a local storage through pod scheduling.Go back to our basic idea.By specifying wait for first consumer for the volume binding mode, we can translate the problem of storage allocation into the problem of pod scheduling.And as for the pod scheduling, Kubernetes provides a rich set of features and we can utilize those features.Now our challenge is translated to distributing pods with PVCs evenly.Well, which type of scheduling criteria should we use to distribute pods evenly?One candidate is anti-affinity.We can distribute one pod per node by using anti-affinity, but we want to use multiple PVs on one node.Anti-affinity does not distinguish whether there are two pods or three pods or four, five, blah, blah, blah.Much appropriate criterion is pod topology spread constraints.This feature was introduced in Kubernetes 1.16, became better in 1.18, and is now stable in 1.19.A set of pod topology spread constraints compute the scheduling school based on the skew.So we can put a cap on the difference of the numbers of pods.This feature shows how pod topology spread constraints work.This is cited from Kubernetes blog.You can specify max skew to describe the degree to which pods may be unevenly distributed.Topology key to group pods by node levels.When unsatisfiable to indicate what to do if the constraints are not satisfiableand level selector to find target pods.In this example, zone 1 has two pods and zone 2 has one pod.The skew between zones is one.Now here comes a new pod.If this new pod is scheduled to zone 1, the skew becomes three.This breaks the constraint.If the pod is scheduled to zone 2, the skew becomes zero.This satisfies the constraint.So the new pods should be scheduled to zone 2.As a result, we can achieve even pod distribution.We can now distribute pods evenly.But when implementing distribution, we need to consider the case where the constraints are not satisfiable.Let's see the details.NAS3ctree may not be desirable in the real world.As described before, there may be uneven availability of local disks.Sever node may have broken disks.Some disks over server may already be assigned for other use,or a rack may have fewer healthy servers than others.In this example, there are six servers. Each server has four disks.Sever3a has broken disks.It has one healthy disk and three broken disks.Let's start assignment.Lap1.I can assign six pods with six disks, one per node.Lap2.I can assign five pods with five disks.As you can see,Sever3a has no available disk.I cannot start lap3 because assigning a pod and a disk will break the constraints no matter which server I choose.Do I need to stop assignment here?It's not desirable.I want to use as many disks as I can.We can relax the pod-topology-spread constraints by using the parameter of when unsatisfiable.This parameter indicates what to do if a pod does not satisfy the constraints.According to the official document,the behavior is described like this.If the constraints are not satisfiable and do not schedule is specified,CubeScheduler does not schedule the pod.This is the default behavior.If the constraints are not satisfiable and the schedule anyway is specified,CubeScheduler still schedules the pod while prioritizing nodes that minimize the skew.Scheduler anyway seemed optimal for us if it really worked as advertised.So, we tried schedule anyway not to limit resource usage due to uneven local disks.We expected the behavior like this.If the constraints are satisfiable,CubeScheduler always schedules the pod within the constraints.Sounds obvious?Unfortunately,it did not work as expected.Let's start with several pods already running in the cluster.Red-leveled pods are running not on all nodes.Four nodes are used and the other two nodes are completely free.The existing pods consume CPU resource to some extent, say 60%,and there is enough room for the new pods we are now distributing.The new pod will consume CPU resource 5%.The constraint we use here is a simple one.Keep the skew less than or equal to 1 among all nodes.Please note that this constraint is not applied for the existing red pods.We applied the constraint in order to distribute storage management pods evenly.The existing pods are just computing.They are not working for storage management.Then deploy six pods for storage management.The expected placement is like this.Because the constraint is satisfiable,six pods would be distributed evenly.The skew would be zero here.But the actual placement is like this.All new pods are assigned to the unused two nodes.The max skew is three.This breaks the constraint even though it seems satisfiable.Why are the new storage management pods scheduled in such a way?We inspected the source code of kube scheduler and find the actual behavior.What we expected had two if-closes.If satisfiable blah blah blah and if not satisfiable blah blah blah.The actual behavior is whether the constraints are satisfiable or not.Kube scheduler does not treat the constraining conditions as real constraints.Instead, the conditions are treated as a part of the squaring factors.So as a result,flattering CPU resource usage took a higher priority.Even though the spread constraint was satisfiable,the existing computing pods prevented even distribution.This came from the prioritizing algorithm of kube scheduler.So we need to tune it.The most important criteria in pod scheduling for our case is the topology spread constraints.So we tuned kube scheduler to weigh the constraints more heavily.Because kube scheduler is rapidly evolving,it requires different tuning configuration for its versions.For Kubernetes 1.17,we adjusted the scheduling policy.There is a weight parameter named even pod spread priority and its default value is 1.We increased the weight to 500.The scheduling policy is a global configuration,so this modification requires extreme caution.In Kubernetes 1.18,the feature of scheduling profiles is introduced.Kube scheduler handles multiple profiles now.We can create a new profile and apply it only to the storage management pods.From among the parameters in the profile,we said pod topology spreads weight to 500 in Kubernetes 1.18.In 1.19,the parameters are slightly tuned by default.So disabling node resources' balanced allocation is suitable for us.I attached our tuning configurations in this slide.Please check them from theskiz.com link.Now I'll give a demonstration.I prepared two environments.One with default kube scheduler and the other with tuned kube scheduler.Let's see the default one first.This is not tuned environment.This is the default kube scheduler.This is demonstration environment.Well,first let's see the nodes.There are four nodes here.And two computing pods are running.They are running on two nodes.They consume city resource.On this cluster,I deploy a distributed storage system,look and safe.I already have deployed them because it takes too long.There are many pods running for look and safe, including non-operator,manager,monitors,and so on.Possessive volumes are assigned for OSD pods.I requested five persistent volumes through five pods and five claims.Five OSD pods and five PVCs.Well,let's see where the OSD pods are running along with where the computing pods are.Five OSD pods and two computing pods.As you can see,five OSD pods are distributed only to three nodes,although there are four available nodes.It seems that the OSD pods avoided to coexist with the computing pods.This is because kube scheduler put a higher priority on flattening CPU resource usage.This is not desirable for distributed storage system.Now let's move into another environment.The tune kube scheduler is running here.Let's see where the OSD pods are running along with where the computing pods are.Five OSD pods and two computing pods.Five OSD pods are distributed well onto four nodes.Some of them coexist with the computing pods.This is what I wanted to do.Let's wrap up with key takeaways.In order to deploy a distributed storage system using Rook and Sep on our on-premise cluster,we translate the problem of local storage distribution into the pod scheduling problem using Word for fast consumer volume binding mode.As for pod scheduling,we use the pod technology spread constraints for better distribution.In order to cope with kube scheduler scoring,we tune it to prioritize the constraints.Our configuration for Rook and Sep is open source in GitHub.Please take a look.That's all.Thank you.