NodeHealthCheck
remediation.medik8s.io / v1alpha1
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: example
apiVersion
string
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
kind
string
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
metadata
object
spec object
NodeHealthCheckSpec defines the desired state of NodeHealthCheck
escalatingRemediations []object
EscalatingRemediations contain a list of ordered remediation templates with a timeout.
The remediation templates will be used one after another, until the unhealthy node
gets healthy within the timeout of the currently processed remediation. The order of
remediation is defined by the "order" field of each "escalatingRemediation".
Mutually exclusive with RemediationTemplate
order
integer required
Order defines the order for this remediation.
Remediations with lower order will be used before remediations with higher order.
Remediations must not have the same order.
remediationTemplate object required
RemediationTemplate is a reference to a remediation template
provided by a remediation provider.
If a node needs remediation the controller will create an object from this template
and then it should be picked up by a remediation provider.
apiVersion
string
API version of the referent.
fieldPath
string
If referring to a piece of an object instead of an entire object, this string
should contain a valid JSON/Go field access statement, such as desiredState.manifest.containers[2].
For example, if the object reference is to a container within a pod, this would take on a value like:
"spec.containers{name}" (where "name" refers to the name of the container that triggered
the event) or if no container name is specified "spec.containers[2]" (container with
index 2 in this pod). This syntax is chosen only to have some well-defined way of
referencing a part of an object.
kind
string
Kind of the referent.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
name
string
Name of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
namespace
string
Namespace of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
resourceVersion
string
Specific resourceVersion to which this reference is made, if any.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
uid
string
UID of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids
timeout
string required
Timeout defines how long NHC will wait for the node getting healthy
before the next remediation (if any) will be used. When the last remediation times out,
the overall remediation is considered as failed.
As a safeguard for preventing parallel remediations, a minimum of 60s is enforced.
Expects a string of decimal numbers each with optional
fraction and a unit suffix, eg "300ms", "1.5h" or "2h45m".
Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
pattern:
^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
healthyDelay
string
HealthyDelay is the time before NHC would allow a node to be healthy again.
A negative value means that NHC will never consider the node healthy and a manual intervention is expected
pattern:
^-?([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
maxUnhealthy
string | integer
Remediation is allowed if no more than "MaxUnhealthy" nodes selected by "selector" are not healthy.
Expects either a non-negative integer value or a percentage value.
Percentage values must be positive whole numbers and are capped at 100%.
0% is valid and will block all remediation.
MaxUnhealthy should not be used with remediators that delete nodes (e.g. MachineDeletionRemediation),
as this breaks the logic for counting healthy and unhealthy nodes.
MinHealthy and MaxUnhealthy are configuring the same aspect,
and they cannot be used at the same time.
string pattern:
^((100|[0-9]{1,2})%|[0-9]+)$
minHealthy
string | integer
Remediation is allowed if at least "MinHealthy" nodes selected by "selector" are healthy.
Expects either a non-negative integer value or a percentage value.
Percentage values must be positive whole numbers and are capped at 100%.
100% is valid and will block all remediation.
MinHealthy and MaxUnhealthy are configuring the same aspect,
and they cannot be used at the same time.
string pattern:
^((100|[0-9]{1,2})%|[0-9]+)$
pauseRequests
[]string
PauseRequests will prevent any new remediation to start, while in-flight remediations
keep running. Each entry is free form, and ideally represents the requested party reason
for this pausing - i.e:
"imaginary-cluster-upgrade-manager-operator"
remediationTemplate object
RemediationTemplate is a reference to a remediation template
provided by an infrastructure provider.
If a node needs remediation the controller will create an object from this template
and then it should be picked up by a remediation provider.
Mutually exclusive with EscalatingRemediations
apiVersion
string
API version of the referent.
fieldPath
string
If referring to a piece of an object instead of an entire object, this string
should contain a valid JSON/Go field access statement, such as desiredState.manifest.containers[2].
For example, if the object reference is to a container within a pod, this would take on a value like:
"spec.containers{name}" (where "name" refers to the name of the container that triggered
the event) or if no container name is specified "spec.containers[2]" (container with
index 2 in this pod). This syntax is chosen only to have some well-defined way of
referencing a part of an object.
kind
string
Kind of the referent.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
name
string
Name of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
namespace
string
Namespace of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
resourceVersion
string
Specific resourceVersion to which this reference is made, if any.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
uid
string
UID of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids
selector object
Label selector to match nodes whose health will be exercised.
Selecting both control-plane and worker nodes in one NHC CR is
highly discouraged and can result in undesired behaviour.
Note: mandatory now for above reason, but for backwards compatibility existing
CRs will continue to work with an empty selector, which matches all nodes.
matchExpressions []object
matchExpressions is a list of label selector requirements. The requirements are ANDed.
key
string required
key is the label key that the selector applies to.
operator
string required
operator represents a key's relationship to a set of values.
Valid operators are In, NotIn, Exists and DoesNotExist.
values
[]string
values is an array of string values. If the operator is In or NotIn,
the values array must be non-empty. If the operator is Exists or DoesNotExist,
the values array must be empty. This array is replaced during a strategic
merge patch.
matchLabels
object
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
map is equivalent to an element of matchExpressions, whose key field is "key", the
operator is "In", and the values array contains only "value". The requirements are ANDed.
stormCooldownDuration
string
StormCooldownDuration defines the duration of an optional cooldown phase after a storm.
A "storm" happens when the number of (un)healthy nodes exceeds the threshold defined by minHealthy or maxUnhealthy.
Sometimes this is triggered by a single root cause.
When that cause is fixed, there is a risk to remediate healthy nodes:
the async nature of node status updates would result in only some nodes being detected as healthy by NHC in a first round of updates,
which results in minHealthy or maxUnhealthy threshold being fulfilled (the storm ends) and triggering unneeded new remediation.
The storm cooldown phase will prevent creation of new remediation for the given duration by giving NHC some time to get the latest node statuses.
Expects a string of decimal numbers each with optional fraction and a unit
suffix, e.g. "300ms", "1.5h" or "2h45m". Valid time units are "ns", "us"
(or "µs"), "ms", "s", "m", "h".
pattern:
^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$unhealthyConditions []object
UnhealthyConditions contains a list of the conditions that determine
whether a node is considered unhealthy. The conditions are combined in a
logical OR, i.e. if any of the conditions is met, the node is unhealthy.
duration
string required
Duration of the condition specified when a node is considered unhealthy.
Expects a string of decimal numbers each with optional
fraction and a unit suffix, eg "300ms", "1.5h" or "2h45m".
Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
pattern:
^([0-9]+(\.[0-9]+)?(ns|us|µs|ms|s|m|h))+$
status
string required
The condition status in the node's status to watch for.
Typically False, True or Unknown.
minLength:
1
type
string required
The condition type in the node's status to watch for.
minLength:
1status object
NodeHealthCheckStatus defines the observed state of NodeHealthCheck
conditions []object
Represents the observations of a NodeHealthCheck's current state.
Known .status.conditions.type are: "Disabled"
lastTransitionTime
string required
lastTransitionTime is the last time the condition transitioned from one status to another.
This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.
format:
date-time
message
string required
message is a human readable message indicating details about the transition.
This may be an empty string.
maxLength:
32768
observedGeneration
integer
observedGeneration represents the .metadata.generation that the condition was set based upon.
For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date
with respect to the current state of the instance.
format:
int64minimum:
0
reason
string required
reason contains a programmatic identifier indicating the reason for the condition's last transition.
Producers of specific condition types may define expected values and meanings for this field,
and whether the values are considered a guaranteed API.
The value should be a CamelCase string.
This field may not be empty.
pattern:
^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$minLength:
1maxLength:
1024
status
string required
status of the condition, one of True, False, Unknown.
enum:
True, False, Unknown
type
string required
type of condition in CamelCase or in foo.example.com/CamelCase.
pattern:
^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$maxLength:
316
healthyNodes
integer
HealthyNodes specified the number of healthy nodes observed
inFlightRemediations
object
InFlightRemediations records the timestamp when remediation triggered per node.
Deprecated in favour of UnhealthyNodes.
lastUpdateTime
string
LastUpdateTime is the last time the status was updated.
format:
date-time
observedNodes
integer
ObservedNodes specified the number of nodes observed by using the NHC spec.selector
phase
string
Phase represents the current phase of this Config.
Known phases are Disabled, Paused, Remediating and Enabled, based on:\n
- the status of the Disabled condition\n
- the value of PauseRequests\n
- the value of InFlightRemediations
reason
string
Reason explains the current phase in more detail.
unhealthyNodes []object
UnhealthyNodes tracks currently unhealthy nodes and their remediations.
conditionsHealthyTimestamp
string
ConditionsHealthyTimestamp is RFC 3339 date and time at which the unhealthy conditions didn't match anymore.
The remediation CR will be deleted at that time, but the node will still be tracked as unhealthy until all
remediation CRs are actually deleted, when remediators finished cleanup and removed their finalizers.
format:
date-time
healthyDelayed
boolean
HealthyDelayed notes whether a node should be considered healthy, but isn't due to NodeHealthCheckSpec.HealthyDelay configuration.
name
string required
Name is the name of the unhealthy node
remediations []object
Remediations tracks the remediations created for this node
resource object required
Resource is the reference to the remediation CR which was created
apiVersion
string
API version of the referent.
fieldPath
string
If referring to a piece of an object instead of an entire object, this string
should contain a valid JSON/Go field access statement, such as desiredState.manifest.containers[2].
For example, if the object reference is to a container within a pod, this would take on a value like:
"spec.containers{name}" (where "name" refers to the name of the container that triggered
the event) or if no container name is specified "spec.containers[2]" (container with
index 2 in this pod). This syntax is chosen only to have some well-defined way of
referencing a part of an object.
kind
string
Kind of the referent.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
name
string
Name of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names
namespace
string
Namespace of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
resourceVersion
string
Specific resourceVersion to which this reference is made, if any.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
uid
string
UID of the referent.
More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids
started
string required
Started is the creation time of the remediation CR
format:
date-time
templateName
string
TemplateName is required when using several templates of the same kind
timedOut
string
TimedOut is the time when the remediation timed out.
Applicable for escalating remediations only.
format:
date-timeNo matches. Try .spec.escalatingRemediations for an exact path