Let’s explore to limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
The most common use case when you want to protect an application specified by one of the build-in Kubernetes controllers:
- Deployment
- ReplicationController
- ReplicaSet
- StatefulSet
In this case, make a note of controller’s .spec.selector
; the same selector goes into the PDBs .spec.selector
You can also use PDBs with pods which are not controlled by on of the above controllers, or arbitrary groups of pods, but there are some restriction, described in Arbitrary workloads and arbitrary selectors.
Think about how your application reacts to disruptions
Decide how many instances can be down at the same time for a short period due to a voluntary disruption.
-
Stateless frontends:
- Concern: don’t reduce serving capacity by more than 10%.
- Solution: use PDB with minAvailable 90% for example.
-
Single-instance Stateful Application:
- Concern: do not terminate this application without talking to me.
- Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
- Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards.
- Concern: do not terminate this application without talking to me.
-
Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
- Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).
- Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5). (Allows more disruptions at once).
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
-
Restartable Batch Job:
- Concern: Job needs to complete in case of voluntary disruption.
- Possible solution: Do not create a PDB. The Job controller will create a replacement pod.
- Concern: Job needs to complete in case of voluntary disruption.
Rounding logic when specifying percentages
Values for minAvailable
or maxUnavailable
can be expressed as integers or as a percentage.
- When you specify an integer, it represents a number of Pods. For instance, if you set
minAvailable
to 10, then 10 Pods must always be available, even during a disruption. - When you specify a percentage by setting the value to a string representation of a percentage (eg.
"50%"
), it represents a percentage of total Pods. For instance, if you setminAvailable
to"50%"
, then at least 50% of the Pods remain available during a disruption.
When you specify the value as a percentage, it may not map to an exact number of Pods. For example, if you have 7 Pods and you set minAvailable
to "50%"
, it’s not immediately obvious whether that means 3 Pods or 4 Pods must be available. Kubernetes rounds up to the nearest integer, so in this case, 4 Pods must be available. When you specify the value maxUnavailable
as a percentage, Kubernetes rounds up the number of Pods that may be disrupted. Thereby a disruption can exceed your defined maxUnavailable
percentage. You can examine the code that controls this behavior.
Specifying a PodDisruptionBudget
A PodDisruptionBudget has three fields:
- A label selector
.spec.selector
to specify the set of pods to which it applies. This field is required. .spec.minAvailable
which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage..spec.maxUnavailable
(available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
Note: The behavior for an empty selector differs between the policy/v1beta1 and policy/v1 APIs for PodDisruptionBudgets. For policy/v1beta1 an empty selector matches zero pods, while for policy/v1 an empty selector matches every pod in the namespace.
You can specify only one of maxUnavailable
and minAvailable
in a single PodDisruptionBudget
. maxUnavailable
can only be used to control the eviction of pods that have an associated controller managing then. In the examples below, “desired replicas” is the scale
of the controller managing the pods being selected by the PodDisruptionBudget
.
-
Example 1: With a
minAvailable
of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’sselector
. -
Example 2: With a
minAvailable
of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy. -
Example 3: With a
maxUnavailable
of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas. -
Example 4: With a
maxUnavailable
of 30%, evictions are allowed as long as the number of unhealthy replicas does not exceed 30% of the total number of desired replica rounded up to the nearest integer. If the total number of desired replicas is just one, that single replica is still allowed for disruption, leading to an effective unavailability of 100%.
In typical usage, a single budget would be used for a collection of pods managed by a controller—for example, the pods in a single ReplicaSet or StatefulSet.
Node: A disruption budget does not truly guarantee that the specified number/percentage of pos will always be up. For example, a node that hosts a pod from the collection may fail when the collection is at the minimim size specified in the budget, thus bringing the number of available pods from the collection below the specified size. The budget can only protect against voluntary evictions, not all causes of unavailability.
If you set maxUnavailable
to 0% or 0, or you set minAvailable
to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics of PodDisruptionBudget
.
You can find examples of pod disruption budgets defined below. They match pods with the label app: zookeeper
.
Example PDB Using minAvailable:
Example PDB Using maxUnavailable:
For example, if the above zk-pdb
object selects the pods of a StatefulSet of size 3, both specifications have the exact same meaning. The use of maxUnavailable
is recommended as it automatically responds to changes in the number of replicas of the corresponding controller.
Unhealthy Pod Eviction Policys
PodDisruptionBudget guarding an application ensures that .status.currentHealthy
number of pods does not fall below the number specified in .statue.desiredHealthy
by disallowing eviction of healthy pods, By using .spec.unhealthyPodEvictionPolicy
, you can also define the criteria when unhealthy pods should be considerd for eviction. The default behavior when no policy is specified corresponds to the IfHealthyBudget
policy.
policies:
-
IfHealthyBudget
- Running pods(
.status.phase="Running"
), but not yet healthy can be evicted only if the guarded application is not disrupted (.status.currentHealthy
is at lease equal to.status.desiredHealthy
). - This policy enssures that runnign pods of an already disrupted application have the best chance to become healthy. This has negative implications for draining nodes, which can be blocked by misbehaving applications that are guarded by a PDB. More specifically application wit pods in
CrashLoopBackOff
state (due to a bug or misconfiguration), or pods that are just failing to report theReady
condition.
- Running pods(
-
AlwaysAllow
- Running pods (
.status.phase="Running
), but not yet healthy are considered disrupted and can be evicted regardless of whether the criteria in a PDB is met. - This means prospective running pods of a disrupted application might not get a chance to become healthy. By using this policy, cluster managers can easily evict misbehaving applications that are quarded by a PDB. More specifically applications with pods in
CrashLoopBackOff
state (due to a bug or misconfiguration), or pods that are just failing to report theReady
condition.
- Running pods (
Note: Pods in Pending, Succeeded or Failed phase are always considered for eviction.
Arbitrary workloads and arbitrary selectors
You can skip this section if you only use PDBs with the built-in workload resources (Deployment, ReplicaSet, StatefulSet and ReplicationController) or with custom reosurces that implement a scale
subresource, and where the PDB selector exactly matches the selector of the Pod’s owning resource.
You can use a PDB with pods controlled by another resource, by an “operator”, or bare pods, but with tease restrictions:
- only
.spec.minAvailable
can be used, not.spec.maxUnavailable
. - only an integer value can be used with
.spec.minAvailable
, not a percentage.
It is not possible to use other availability configurations, because Kubernetes cannot derive a total number of pods without a supported owning resource.
You can use a selector which selects a subset or superset of the pods beloging to a workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs, so most users will want to avoid overlappign selectors. One reasonable use of overlapping PDBs is when pods are being transitioned from on PDB to onother.
reference