A few days ago a friend got their kubernetes cluster hung and asked me to help recover it, and since a lot of the site is gone, here's a solution idea.
Environmental issues
The environment has one master node, i.e., the control plane pods (etcd, scheduler, etc.) also all have only one pod
The problem started when they had a problem accessing their service, and to fix the problem, he:
- Backed up etcd data (data is 3 days old)
- Restarted docker
- Recovered etcd data (data is 3 days old)
Then access to the service still doesn't work.
Problem diagnosis
deployment reversion mismatch
The first thing to see is that pods are notrunning
state, delete the pod directly, let it rebuild, check the pod creation process, and find that the pod is not assigned to a node.
Problem analysis
The first suspicion was that there might be a problem with kube-scheduler:
- Deleting a kube-scheduler pod reveals that it cannot be rebuilt.
- Finally, by combining the
/etc/kubernetes/manifests/
A scheduler pod is created by moving files out and back in.
At this time still can not schedule pod, so suspect that there is a problem before the scheduler, check the logs of the api-server, found that there are a lot of reversion version does not match the error, it should be the version of the resources in the cluster and the version of the resources in the etcd does not match the result:
-
Use etctctl to check the status of etcd and find that everything is fine with etcd
etcdctl endpoint health etcdctl endpoint status --write-out=table
-
utilization
kubectl rollout history deployment/<deployment_name>
Check the version of the deployment saved in etcd, and then execute thekubectl rollout undo daemonset/<deployment_name> --to-revision=<version>
Roll back to a version that matches etcd.Before rolling back you can pass the
kubectl rollout history daemonset/<deployment_name> --revision=<version>
Compare the difference between etcd and configuration in the environment -
After rolling back, I found that the pod could be created normally
Loss of Iptables
After the pod is up, service access is still not available. The service is not accessible after the pod is up.kubectl describe
command to view the service of the service, found that there is no endpoints corresponding to the service, at first thought it was a problem with the service's yaml, debugged for half a day and found that the vast majority of services do not have endpoints.
Problem analysis
service cannot find endpoints, which is reflected in the system by the fact that the iptables rule may not have been created:
- utilization
iptables-save
command to see that, sure enough, there are no iptables rules for kubernetes - The environment is in ipvs mode, using the
ipvsadm -l -n
It was also found that the service's cluster IP does not have a corresponding pod IP - Checking the kube-proxy logs did not reveal any anomalies
The ways that come to mind at this point are:
- Re-create the pod and corresponding service, refresh iptables: The attempt failed, no iptables were generated after the rebuild.
-
Rebuild node: All nodes have problems getting through
kubectl drain
Migrating pods - Add iptables manually: It is too complex and even if it succeeds, it will pollute the iptables rules of the node.
- Re-create the kube-proxy pod: the iptables rule was not created even after restarting the kube-proxy pod
Finally, it was suspected that kube-proxy might also have a problem and that kube-proxy needed to be reinitialized, and it just so happens that kubeadm has the following command to reinitialize kube-proxy:
kubeadm init phase addon kube-proxy --kubeconfig ~/.kube/config --apiserver-advertise-address <api-server-ip>
After reinitializing kube-proxy, we found that the iptables rules were created successfully, and after deleting and creating the pod and service, the corresponding iptables rules were created correctly, and the service had endpoints at this time.
CNI connection error
After restarting the pods in the previous step, it was found that one of the pods corresponding to the webhook had not been restarted successfully, using thekubectl describe
The pod found the following error:
networkPlugin cni failed to set up pod "webhook-1" network: Get "https://[10.233.0.1]:443/api/v1/namespaces/volcano-system": dial tcp 10.233.0.1:443: i/o timeout
The cluster is using a calico CNI, and looking at the daemonset corresponding to that CNI reveals that only 5 pods are READY.
Delete the "calico-node" pod on the node where the "webhook-1" pod is located, and the "calico-node" pod fails to start.
Problem analysis
In the above error, "10.233.0.1" is the service cluster IP of the kubernetes apiserver, and since the "clico-node" pod is using thehostnetwork
, so you can test connectivity directly on node, using thetelnet 10.233.0.1 443
Tested it and found that it didn't work.
Calico./etc/cni//
The configuration file defines what is needed to connect to the apiserver.kubeconfig
Documentation:
{
"name": "cni0",
"cniVersion":"0.3.1",
"plugins":[
{
...
"kubernetes": {
"kubeconfig": "/etc/cni//calico-kubeconfig"
}
},
...
]
}
(indicates contrast)/etc/cni//calico-kubeconfig
The address and port required to connect to the apiserver is defined in the apiserver pod, so simply replacing that address and port with the address and port of the apiserver pod should solve the problem:
# cat /etc/cni//calico-kubeconfig
# Kubeconfig file for Calico CNI plugin.
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
server: https://[10.233.0.1]:443
certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0...
users:
- name: calico
user:
token: eyJhbGciOiJSUzI1NiIsImtpZC...
contexts:
- name: calico-context
context:
cluster: local
user: calico
CalicofurnishThe following two environment variables are used to modify the address and port of the apiserver in the generated kubeconfig, add the following environment variables to calico's daemonset, recreate thecalico-node
pod will do:
- name: KUBERNETES_SERVICE_HOST
value: <api-server-pod-ip>
- name: KUBERNETES_SERVICE_PORT
value: "6443"
At this point, the problem is basically solved. Due to the wrong operation, the cluster has a large number of problems, and you can follow up by evicting the node pod, reinitializing the entire node, and gradually resetting the cluster node configuration.