My Hate and Love for Call Me By Your Name

This year I made it my goal to see almost every best picture nomination as well as any film that seemed to have a chance at the category. I still have to see Darkest Hour and The Post but my point…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Troubleshoot in k8s

Kubernetes troubleshooting is the process of identifying, diagnosing, and resolving issues in Kubernetes clusters, nodes, pods, or containers.

More broadly defined, Kubernetes troubleshooting also includes effective ongoing management of faults and taking measures to prevent issues in Kubernetes components.

Kubernetes troubleshooting can be very complex. This article will focus on:

There are three aspects to effective troubleshooting in a Kubernetes cluster: understanding the problem, managing and remediating the problem, and preventing the problem from recurring.

In a Kubernetes environment, it can be very difficult to understand what happened and determine the root cause of the problem. This typically involves:

To achieve the above, teams typically use the following technologies:

In a microservices architecture, it is common for each component to be developed and managed by a separate team. Because production incidents often involve multiple components, collaboration is essential to remediate problems fast.

Once the issue is understood, there are three approaches to remediating it:

To achieve the above, teams typically use the following technologies:

Successful teams make prevention their top priority. Over time, this will reduce the time invested in identifying and troubleshooting new issues. Preventing production issues in Kubernetes involves:

To achieve the above, teams commonly use the following technologies:

Kubernetes is a complex system, and troubleshooting issues that occur somewhere in a Kubernetes cluster is just as complicated.

Even in a small, local Kubernetes cluster, it can be difficult to diagnose and resolve issues, because an issue can represent a problem in an individual container, in one or more pods, in a controller, a control plane component, or more than one of these.

In a large-scale production environment, these issues are exacerbated, due to the low level of visibility and a large number of moving parts. Teams must use multiple tools to gather the data required for troubleshooting and may have to use additional tools to diagnose issues they detect and resolve them.

To make matters worse, Kubernetes is often used to build microservices applications, in which each microservice is developed by a separate team. In other cases, there are DevOps and application development teams collaborating on the same Kubernetes cluster. This creates a lack of clarity about division of responsibility — if there is a problem with a pod, is that a DevOps problem, or something to be resolved by the relevant application team?

In short — Kubernetes troubleshooting can quickly become a mess, waste major resources and impact users and application functionality — unless teams closely coordinate and have the right tools available.

If you are experiencing one of these common Kubernetes errors, here’s a quick guide to identifying and resolving the problem:

This error is usually the result of a missing Secret or ConfigMap. Secrets are Kubernetes objects used to store sensitive information like database credentials. ConfigMaps store data as key-value pairs, and are typically used to hold configuration information used by multiple pods.

Run kubectl get pods .

Check the output to see if the pod’s status is CreateContainerConfigError

To get more information about the issue, run kubectl describe [name] and look for a message indicating which ConfigMap is missing:

Now run this command to see if the ConfigMap exists in the cluster.

For example $ kubectl get configmap configmap-3

Make sure the ConfigMap is available by running get configmap [name] again. If you want to view the content of the ConfigMap in YAML format, add the flag -o yaml.

Once you have verified the ConfigMap exists, run kubectl get pods again, and verify the pod is in status Running:

This status means that a pod could not run because it attempted to pull a container image from a registry, and failed. The pod refuses to start because it cannot create one or more containers defined in its manifest.

Run the command kubectl get pods

Check the output to see if the pod status is ImagePullBackOff or ErrImagePull:

Run the kubectl describe pod [name] command for the problematic pod.

The output of this command will indicate the root cause of the issue. This can be one of the following:

This issue indicates a pod cannot be scheduled on a node. This could happen because the node does not have sufficient resources to run the pod, or because the pod did not succeed in mounting the requested volumes.

Run the command kubectl get pods.

Check the output to see if the pod status is CrashLoopBackOff

Run the kubectl describe pod [name] command for the problematic pod:

The output will help you identify the cause of the issue. Here are the common causes:

When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable, and the node status appears as NotReady.

If a node has a NotReady status for over five minutes (by default), Kubernetes changes the status of pods scheduled on it to Unknown, and attempts to schedule it on another node, with status ContainerCreating.

Run the command kubectl get nodes.

Check the output to see is the node status is NotReady

To check if pods scheduled on your node are being moved to other nodes, run the command get pods.

Check the output to see if a pod appears twice on two different nodes, as follows:

If the failed node is able to recover or is rebooted by the user, the issue will resolve itself. Once the failed node recovers and joins the cluster, the following process takes place:

If you have no time to wait, or the node does not recover, you’ll need to help Kubernetes reschedule the stateful pods on another, working node. There are two ways to achieve this:

If you’re experiencing an issue with a Kubernetes pod, and you couldn’t find and quickly resolve the error in the section above, here is how to dig a bit deeper. The first step to diagnosing pod issues is running kubectl describe pod [name].

We bolded the most important sections in the describe pod output:

Continue debugging based on the pod state.

If a pod’s status is Pending for a while, it could mean that it cannot be scheduled onto a node. Look at the describe pod output, in the Events section. Try to identify messages that indicate why the pod could not be scheduled. For example:

If a pod’s status is Waiting, this means it is scheduled on a node, but unable to run. Look at the describe pod output, in the ‘Events’ section, and try to identify reasons the pod is not able to run.

Most often, this will be due to an error when fetching the image. If so, check for the following:

If a pod is not running as expected, there can be two common causes: error in pod manifest, or mismatch between your local pod manifest and the manifest on the API server.

It is common to introduce errors into a pod description, for example by nesting sections incorrectly, or typing a command incorrectly.

Try deleting the pod and recreating it with kubectl apply --validate -f mypod1.yaml

This command will give you an error like this if you misspelled a command in the pod manifest, for example if you wrote continers instead of containers:

It can happen that the pod manifest, as recorded by the Kubernetes API Server, is not the same as your local manifest — hence the unexpected behavior.

Run this command to retrieve the pod manifest from the API server and save it as a local YAML file:

You will now have a local file called apiserver-[pod-name].yaml, open it and compare with your local YAML. There are three possible cases:

If you weren’t able to diagnose your pod issue using the methods above, there are several additional methods to perform deeper debugging of your pod:

You can retrieve logs for a malfunctioning container using this command:

If the container has crashed, you can use the --previous flag to retrieve its crash log, like so:

Many container images contain debugging utilities — this is true for all images derived from Linux and Windows base images. This allows you to run commands in a shell within the malfunctioning container, as follows:

There are several cases in which you cannot use the kubectl exec command:

The solution, supported in Kubernetes v.1.18 and later, is to run an “ephemeral container”. This is a container that runs alongside your production container and mirrors its activity, allowing you to run shell commands on it, as if you were running them on the real container, and even after it crashes.

Create an ephemeral container using kubectl debug -it [pod-name] --image=[image-name] --target=[pod-name].

The --target flag is important because it lets the ephemeral container communicate with the process namespace of other containers running on the pod.

After running the debug command, kubectl will show a message with your ephemeral container name — take note of this name so you can work with the container:

You can now run kubectl exec on your new ephemeral container, and use it to debug your production container.

If none of these approaches work, you can create a special pod on the node, running in the host namespace with host privileges. This method is not recommended in production environments for security reasons.

Run a special debug pod on your node using kubectl debug node/[node-name] -it --image=[image-name].

After running the debug command, kubectl will show a message with your new debugging pod — take note of this name so you can work with it:

Note that the new pod runs a container in the host IPC, Network, and PID namespaces. The root filesystem is mounted at /host.

When finished with the debugging pod, delete it using kubectl delete pod [debug-pod-name].

The first step to troubleshooting container issues is to get basic information on the Kubernetes worker nodes and Services running on the cluster.

To see a list of worker nodes and their status, run kubectl get nodes --show-labels. The output will be something like this:

NAME STATUS ROLES AGE VERSION LABELS
worker0 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker0
worker1 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker1
worker2 Ready [none] 1d v1.13.0 ...,kubernetes.io/hostname=worker2

To get information about Services running on the cluster, run:

kubectl cluster-info

The output will be something like this:

Let’s look at several common cluster failure scenarios, their impact, and how they can typically be resolved. This is not a complete guide to cluster troubleshooting, but can help you resolve the most common issues.

The troubleshooting process in Kubernetes is complex and, without the right tools, can be stressful, ineffective and time-consuming. Some best practices can help minimize the chances of things breaking down, but eventually, something will go wrong — simply because it can.

This is the reason why we created Komodor, a tool that helps dev and ops teams stop wasting their precious time looking for needles in (hay)stacks every time things go wrong.

Acting as a single source of truth (SSOT) for all of your k8s troubleshooting needs, Komodor offers:

My Hate and Love for Call Me By Your Name

Troubleshoot in k8s

Add a comment

Related posts:

Why Feeling Like a Scolded Child Led Me to Put the MBTA in Time Out

evolved

Is Gamification The Key to a Smarter Generation?