Powered by GitBook

5.5.1 Alerts List

An alert occurs when the occurrence condition in the following list continues for the duration time.

AlertManager

Alarm ID	ALM-001
Severity	warning
Alarm Name	AlertmanagerDown
duration	5minute
occurrence condition	Alertmanager Occurs when metrics can not be collected
Action	Check Prometheus logs and Alertmanager logs and events. If necessary, restart the Pod.

Alarm ID	ALM-002
Severity	warning
Alarm Name	AlertmanagerFailedReload
duration	10minute
occurrence condition	When setting change of Alertmanager, it occurs when setting re-read operation fails
Action	Check the log of the pod to fix the configuration error of ConfigMap.

ETCD3

Alarm ID	ETC-001
Severity	critical
Alarm Name	InsufficientMembers
duration	3minute
occurrence condition	Occurs when ETCD metrics can not be collected
Action	Check the status of the ETCD cluster. Check Prometheus's log and its node's etc status.

Alarm ID	ETC-002
Severity	critical
Alarm Name	NoLeader
duration	1minute
occurrence condition	Occurs when there is no ETCD reader
Action	Check the status of the ETCD cluster. Disk Latency may be a problem, so run the following command on all nodes of the ETCD cluster. (ETCD Tuning) $ sudo ionice -c2 -n0 -p `pgrep etcd`

Alarm ID	ETC-003
Severity	warning
Alarm Name	HighNumberOfLeaderChanges
duration	Immediately
occurrence condition	If there are more than 3 reader changes in the last hour
Action	Check the status of the ETCD cluster. Disk Latency may be a problem, so run the following command on all nodes of the ETCD cluster. (ETCD Tuning) $ sudo ionice -c2 -n0 -p `pgrep etcd`

Alarm ID	ETC-004
Severity	warning
Alarm Name	HighNumberOfFailedGRPCRequests
duration	10minute
occurrence condition	If more than 1% of gRPC method calls fail within the last 5 minutes
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-005
Severity	critical
Alarm Name	HighNumberOfFailedGRPCRequests
duration	5minute
occurrence condition	If more than 5% of gRPC method calls fail within the last 5 minutes
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-006
Severity	critical
Alarm Name	GRPCRequestsSlow
duration	10minute
occurrence condition	If the 99th percentile of the gRPC method request latency over the last 5 minutes is greater than 150ms
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-007
Severity	warning
Alarm Name	HighNumberOfFailedHTTPRequests
duration	10minute
occurrence condition	If more than 1% of requests for HTTP endpoints fail within the last 5 minutes
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-008
Severity	critical
Alarm Name	HighNumberOfFailedHTTPRequests
duration	5minute
occurrence condition	If more than 5% of requests for HTTP endpoints fail within the last 5 minutes
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-009
Severity	warning
Alarm Name	HTTPRequestsSlow
duration	10minute
occurrence condition	If the 99th percentile of the HTTP request latency during the last 5 minutes is greater than 150ms
Action	Increase the bandwidth of ETCD clusters and Kubernetes clusters, or Sacale-Up clusters.

Alarm ID	ETC-010
Severity	warning
Alarm Name	EtcdMemberCommunicationSlow
duration	10minute
occurrence condition	If the 99th percentile of the communication wait time between members in the last 5 minutes is greater than 150ms
Action	ETCD Increase the bandwidth of the cluster or scale up the cluster.

Alarm ID	ETC-011
Severity	warning
Alarm Name	HighNumberOfFailedProposals
duration	Immediately
occurrence condition	If there are more than 5 failed raft protocol requests in the last hour. (RAFT Protocol is ETCD Synchronization Protocol ).
Action	The ETCD metric document says that ETCD cluster downtime is getting longer due to temporary electoral failure or lack of members. Check if there is a reader or an abandoned ETCD member

Alarm ID	ETC-012
Severity	warning
Alarm Name	HighFsyncDurations
duration	10minute
occurrence condition	If the 99th percentile of the wal fsync duration for the last 5 minutes is greater than 500ms (wal fsync: Called when saving to disk before applying log entries.)
Action	According to the ETCD metrics document, there is a problem with the disk.

Alarm ID	ETC-013
Severity	warning
Alarm Name	HighCommitDurations
duration	10minute
occurrence condition	If the 99th percentile of the last 5 minutes is greater than 250ms, then backend commit is the incremental snapshot commit of the most recent change to the disk.
Action	According to the ETCD metrics document, there is a problem with the disk.

General

Alarm ID	GEN-001
Severity	warning
Alarm Name	TargetDown
duration	10minute
occurrence condition	Occurs when metric gathering is not possible. Indicates which operation failed.
Action	Check the Prometheus log and the logs and events of the pod corresponding to the job.

Alarm ID	GEN-002
Severity	~~none~~
Alarm Name	DeadMansSwitch
duration	Immediately
occurrence condition	DeadMansSwitch notification.
Action	The alarm is not notified to the user.

Alarm ID	GEN-003
Severity	critical
Alarm Name	TooManyOpenFileDescriptors
duration	10minute
occurrence condition	Occurs when file descriptor usage is over 95%
Action	Change the limit value of the node. (Requires restart of node)

Alarm ID	GEN-004
Severity	warning
Alarm Name	FdExhaustionClose
duration	10minute
occurrence condition	If the file descriptor depletion is predicted within 4 hours using simple linear regression
Action	Check the logs and events of the corresponding pod. If necessary, change the limit value of the node. (Requires restart of node)

Alarm ID	GEN-005
Severity	critical
Alarm Name	FdExhaustionClose
duration	10minute
occurrence condition	If the file descriptor depletion is predicted within 1 hours using simple linear regression
Action	Check the logs and events of the corresponding pod. If necessary, change the limit value of the node. (Requires restart of node)

Kube-ApiServer

Alarm ID	KAS-001
Severity	critical
Alarm Name	K8SApiserverDown
duration	5minute
occurrence condition	kube-apiserver Occurs when metrics can not be collected
Action	Check Prometheus logs and kube-apiserver logs and events. If necessary, restart the Pod.

Alarm ID	KAS-002
Severity	warning
Alarm Name	K8SApiServerLatency
duration	10minute
occurrence condition	If the 99th percentile of the waiting time for the last 10 minutes is greater than 1s
Action	If it continues to occur, the master node is added.

Kube-ControllerManager

Alarm ID	KCM-001
Severity	critical
Alarm Name	K8SControllerManagerDown
duration	5minute
occurrence condition	kube-controller-manager Occurs when metrics can not be collected
Action	Check Prometheus logs and kube-controller-manager logs and events. If necessary, restart the Pod.

Kube-Scheduler

Alarm ID	KSC-001
Severity	critical
Alarm Name	K8SSchedulerDown
duration	5minute
occurrence condition	If kube-scheduler metrics can not be collected
Action	Check Prometheus logs and kube-scheduler logs and events. If necessary, restart the Pod.

Kube-State-Metrics

Alarm ID	KSM-001
Severity	warning
Alarm Name	DeploymentGenerationMismatch
duration	15minute
occurrence condition	Occurs when generation set in Deployment differs from collected generation
Action	Check the logs and events of the deployment. If necessary, redistribute Deployment.

Alarm ID	KSM-002
Severity	warning
Alarm Name	DeploymentReplicasNotUpdated
duration	15minute
occurrence condition	Occurs when the number of replicas set in the deployment is different from the number of replicas in the changed or available state.
Action	Check the logs and events of Deployment and Pod because Deployment fix is not reflected.

Alarm ID	KSM-003
Severity	warning
Alarm Name	DaemonSetRolloutStuck
duration	15minute
occurrence condition	Occurs when there are pods whose status is not Ready in DaemonSet
Action	Check the Daemonset and Pod logs and events.

Alarm ID	KSM-004
Severity	warning
Alarm Name	K8SDaemonSetsNotScheduled
duration	10minute
occurrence condition	Occurs when the number of running pods is smaller than the number of pods to be executed in DaemonSet.
Action	Check the Daemonset and Pod logs and events. Verify that the node that is not deployed is healthy. If the master node is isolated, make sure that the daemonset has tolerance settings.

Alarm ID	KSM-005
Severity	warning
Alarm Name	DaemonSetsMissScheduled
duration	10minute
occurrence condition	Occurs when an improperly scheduled pod is encountered in DaemonSet
Action	Check the Daemonset and Pod logs and events.

Alarm ID	KSM-006
Severity	warning
Alarm Name	PodFrequentlyRestarting
duration	10minute
occurrence condition	Occurs when the number of pod restarts is more than 5 times in the last hour
Action	Check the logs and events of the corresponding pod. Restart the Pod if necessary.

Kubelet

Alarm ID	KBL-001
Severity	warning
Alarm Name	K8SNodeNotReady
duration	1hours
occurrence condition	Occurs when the Node state is not Ready
Action	Check the status and events of that node. Connect to the node via ssh and check the status of the kubelet.

Alarm ID	KBL-002
Severity	critical
Alarm Name	K8SManyNodesNotReady
duration	1minute
occurrence condition	Occurs when the percentage of nodes that are not Ready is more than 20% across the cluster
Action	Check the status and events of the nodes. Connect to the node via ssh and check the status of the kubelet.

Alarm ID	KBL-003
Severity	warning
Alarm Name	K8SKubeletDown
duration	1hours
occurrence condition	Occurs when kubelet metrics are not collected over 3% across the cluster
Action	Check the Prometheus log and its node status and events. Connect to the node via ssh and check the status of the kubelet.

Alarm ID	KBL-004
Severity	critical
Alarm Name	K8SKubeletDown
duration	1hours
occurrence condition	Occurs when more than 10% kubelet metrics are not collected across the cluster
Action	Check the Prometheus logs and their status and events. Connect to the node via ssh and check the status of the kubelet.

Alarm ID	KBL-005
Severity	warning
Alarm Name	K8SKubeletTooManyPods
duration	Immediately
occurrence condition	Occurs when the number of pods placed in the Node exceeds 100 (the limit is 110)
Action	If the limit value is reached, no more pod generation is possible. If the state of other nodes is checked as well, the node is added.

Node

Alarm ID	NOD-001
Severity	warning
Alarm Name	NodeExporterDown
duration	10minute
occurrence condition	NodeExporter occurs when metrics can not be collected
Action	Check the Prometheus logs and NodeExporter logs and events. If necessary, restart the Pod.

Alarm ID	NOD-002
Severity	critical
Alarm Name	K8SNodeOutOfDisk
duration	Immediately
occurrence condition	Occurs when the Node state is OutOfDisk.
Action	The disk of the node is expanded.

Alarm ID	NOD-003
Severity	warning
Alarm Name	K8SNodeMemoryPressure
duration	Immediately
occurrence condition	Occurs when the Node state is MemoryPressure
Action	The memory of the node is expanded.

Alarm ID	NOD-004
Severity	warning
Alarm Name	K8SNodeDiskPressure
duration	Immediately
occurrence condition	Occurs when Node Status is DiskPressure
Action	Free up disk space by removing logs, unused dodkcer image, and pv backup from the node. If it continues to occur, the disk of the corresponding node is added.

Alarm ID	NOD-005
Severity	warning
Alarm Name	NodeCPUUsage
duration	30minute
occurrence condition	Node Average CPU usage in the last 5 minutes exceeds 90%
Action	Increase the CPU of the node.

Alarm ID	NOD-006
Severity	warning
Alarm Name	NodeMemoryUsage
duration	30minute
occurrence condition	Node memory usage exceeds 90%
Action	The memory of the node is expanded.

Prometheus

Alarm ID	PRM-001
Severity	warning
Alarm Name	PrometheusFailedReload
duration	10minute
occurrence condition	Prometheus changes settings, re-read settings
Action	Check the log of the pod to fix the configuration error of ConfigMap.

Cocktail

Alarm ID	CKT-001
Severity	warning
Alarm Name	PvLowRequestDisk
duration	30minute
occurrence condition	PV is over 80% of the requested disk size
Action	Increase the size of PV. However, you must redistribute the server.

Alarm ID	CKT-002
Severity	warning
Alarm Name	PvLowTotalDisk
duration	30minute
occurrence condition	Occurs when PV is mounted on a disk and its usage exceeds 80%.
Action	Check the status of the mounted disk and remove unused PV. If necessary, expand the disk.

Alarm ID	CKT-003
Severity	warning
Alarm Name	PodCPULimitUsage
duration	30minute
occurrence condition	Occurs when the CPU utilization exceeds 90% of the Resource Limit setting value
Action	Change the CPU Limit value of the Deployment if it continues to occur

Alarm ID	CKT-004
Severity	warning
Alarm Name	PodMemoryLimitUsage
duration	30minute
occurrence condition	Occurs when memory usage exceeds 90% of Resource Limit setting value
Action	Change Memory Limit value of Deployment if it continues to occur

results matching ""

No results matching ""