Fixing a kubernetes cluster

A few days ago a friend got their kubernetes cluster hung and asked me to help recover it, and since a lot of the site is gone, here's a solution idea.

Environmental issues

The environment has one master node, i.e., the control plane pods (etcd, scheduler, etc.) also all have only one pod

The problem started when they had a problem accessing their service, and to fix the problem, he:

Backed up etcd data (data is 3 days old)
Restarted docker
Recovered etcd data (data is 3 days old)

Then access to the service still doesn't work.

Problem diagnosis

deployment reversion mismatch

The first thing to see is that pods are notrunningstate, delete the pod directly, let it rebuild, check the pod creation process, and find that the pod is not assigned to a node.

Problem analysis

The first suspicion was that there might be a problem with kube-scheduler:

Deleting a kube-scheduler pod reveals that it cannot be rebuilt.
Finally, by combining the/etc/kubernetes/manifests/A scheduler pod is created by moving files out and back in.

At this time still can not schedule pod, so suspect that there is a problem before the scheduler, check the logs of the api-server, found that there are a lot of reversion version does not match the error, it should be the version of the resources in the cluster and the version of the resources in the etcd does not match the result:

Use etctctl to check the status of etcd and find that everything is fine with etcd
```
etcdctl endpoint health
etcdctl endpoint status --write-out=table
```
utilizationkubectl rollout history deployment/<deployment_name>Check the version of the deployment saved in etcd, and then execute thekubectl rollout undo daemonset/<deployment_name> --to-revision=<version>Roll back to a version that matches etcd.

Before rolling back you can pass thekubectl rollout history daemonset/<deployment_name> --revision=<version>Compare the difference between etcd and configuration in the environment
After rolling back, I found that the pod could be created normally

Loss of Iptables

After the pod is up, service access is still not available. The service is not accessible after the pod is up.kubectl describecommand to view the service of the service, found that there is no endpoints corresponding to the service, at first thought it was a problem with the service's yaml, debugged for half a day and found that the vast majority of services do not have endpoints.

Problem analysis

service cannot find endpoints, which is reflected in the system by the fact that the iptables rule may not have been created:

utilizationiptables-savecommand to see that, sure enough, there are no iptables rules for kubernetes
The environment is in ipvs mode, using theipvsadm -l -nIt was also found that the service's cluster IP does not have a corresponding pod IP
Checking the kube-proxy logs did not reveal any anomalies

The ways that come to mind at this point are:

Re-create the pod and corresponding service, refresh iptables: The attempt failed, no iptables were generated after the rebuild.
Rebuild node: All nodes have problems getting throughkubectl drainMigrating pods
Add iptables manually: It is too complex and even if it succeeds, it will pollute the iptables rules of the node.
Re-create the kube-proxy pod: the iptables rule was not created even after restarting the kube-proxy pod

Finally, it was suspected that kube-proxy might also have a problem and that kube-proxy needed to be reinitialized, and it just so happens that kubeadm has the following command to reinitialize kube-proxy:

kubeadm init phase addon kube-proxy --kubeconfig ~/.kube/config --apiserver-advertise-address <api-server-ip>

After reinitializing kube-proxy, we found that the iptables rules were created successfully, and after deleting and creating the pod and service, the corresponding iptables rules were created correctly, and the service had endpoints at this time.

CNI connection error

After restarting the pods in the previous step, it was found that one of the pods corresponding to the webhook had not been restarted successfully, using thekubectl describe The pod found the following error:

networkPlugin cni failed to set up pod "webhook-1" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/volcano-system": dial tcp 10.233.0.1:443: i/o timeout

The cluster is using a calico CNI, and looking at the daemonset corresponding to that CNI reveals that only 5 pods are READY.

Delete the "calico-node" pod on the node where the "webhook-1" pod is located, and the "calico-node" pod fails to start.

Problem analysis

In the above error, "10.233.0.1" is the service cluster IP of the kubernetes apiserver, and since the "clico-node" pod is using thehostnetwork, so you can test connectivity directly on node, using thetelnet 10.233.0.1 443Tested it and found that it didn't work.

Calico./etc/cni//The configuration file defines what is needed to connect to the apiserver.kubeconfigDocumentation:

{
  "name": "cni0",
  "cniVersion":"0.3.1",
  "plugins":[
    {
      ... 
      "kubernetes": {
        "kubeconfig": "/etc/cni//calico-kubeconfig"
      }
    },
    ...
  ]
}

(indicates contrast)/etc/cni//calico-kubeconfigThe address and port required to connect to the apiserver is defined in the apiserver pod, so simply replacing that address and port with the address and port of the apiserver pod should solve the problem:

# cat /etc/cni//calico-kubeconfig
# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
  cluster:
    server: https://[10.233.0.1]:443
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0...
users:
- name: calico
  user:
    token: eyJhbGciOiJSUzI1NiIsImtpZC...
contexts:
- name: calico-context
  context:
    cluster: local
    user: calico

CalicofurnishThe following two environment variables are used to modify the address and port of the apiserver in the generated kubeconfig, add the following environment variables to calico's daemonset, recreate thecalico-node pod will do:

- name: KUBERNETES_SERVICE_HOST
  value: <api-server-pod-ip>
- name: KUBERNETES_SERVICE_PORT
  value: "6443"

At this point, the problem is basically solved. Due to the wrong operation, the cluster has a large number of problems, and you can follow up by evicting the node pod, reinitializing the entire node, and gradually resetting the cluster node configuration.