Hello, readers! This article talks about Debugging Kubernetes Cluster Level issues discussing various debugging scenarios.
So, let us begin!! 🙂
Troubleshooting a Kubernetes cluster
A Kubernetes cluster is like a home for all the applications that sit within a container and run in the form of a Pod. It is like the heart of the entire system. In the previous blog, we have introspected the ways to tackle and debug application-level issues.
This blog talks about the issues at the Kubernetes cluster level. At times, when we find no issues with the applications, there may be some issue with the underlying infrastructure supporting it, that is, the Kubernetes Cluster.
The foremost step is to introspect the status of the nodes. We can have a look at the status of the nodes using the below command-
kubectl get nodes
Executing the above command, we need to verify the status of all the nodes. All the nodes should be in a Ready state ideally.
In order to fetch the detailed report about the health of the cluster, we can make use of the below command-
kubectl cluster-info dump
Moving ahead, let us now understand some more debugging scenarios in the upcoming section.
Debugging cluster level issue – Investigating the logs
When it comes to investigating an issue deep down to the core level, logging is a mechanism that supports the entire debugging process.
We can find the logs of the master node related components at the below locations-
/var/log/kube-apiserver.log– API Server
/var/log/kube-scheduler.log– Kube Scheduler
The logs of the worker nodes can be found at the below locations-
/var/log/kube-proxy.log– Kube Proxy
We may need to use systemctl or journalctl to view and understand the logs in detail.
The below command enables us to view the log of the kubelet process on the worker node.
journalctl -u kubelet
Root causes for the cluster failure/issues
- The primary root cause for any kubernetes cluster failure can be because of the Virtual Machine Shutdown.
- The partition of network between the clusters.
- Unavailability of the persistent data storage such as Persistent volume solution, etc.
- Configuration error in the kubernetes software installation.
- Loss of services, pods, etc.
- Users faces issue reading the APIs.
- Issue with the Kubelet process. In this scenario, the kubelet crashes occasionally and fails to start new pods.
- Issue with the worker node. In this scenario, the node shuts down and the pods on the Node stop working, etc.
To mitigate the above scenarios, we need to have a detailed look at the error and then figure out the workaround for the same.
Workaround for the cluster issues
- For the issues with the shutting down of a VM, we can make use of IaaS providers to provision a VM with an automatic restart policy. Such as GCP, Azure, etc.
- We can make use of Iaas storage provided by GCP, Azure as the backend for the persistent data storage solution in place.
- Provision High Availability etcd configuration for the data loss to mitigate.
- We can make use of replica controllers and services for the loss of pods and resources that may happen due to nodes or kubelet process shutdown.
- Take periodic snapshot of the API server, Persistent data volumes to prevent the data loss in case of disaster recovery.
By this, we have approached the end of this topic. Feel free to comment below, in case you come across any questions.
For more such posts related to Docker and Kubernetes, Stay tuned with us.
Till then, Happy Learning!! 🙂