In the online cluster, the business is running, suddenly found a Pod on a large number of error logs, the other Pod is normal, how to deal with it?
- Delete Pod directly?
This does not facilitate the preservation of the site and may affect the determination of the root cause of the problem
- Let the business side hold off for a while and troubleshoot the problem first?
You'll get sprayed to death.
The best solution is to both stop the Pod from receiving traffic and keep the Pod
Thoughts:
- Stop receiving traffic
The action to stop receiving traffic is accomplished by modifying the label of the Pod. The essence is to remove the Pod from the endpoint, so that both servitization and http will remove the current node and no longer forward traffic.
Of course, the premise here is that the node discovery for servitization and http is based on k8s endpoint (in theory everyone does this, not excluding hacks).
The first thing to do is to proactively call the service down method, which theoretically should be paired with the Pod's prestop hook, so that when the Pod is deleted, it will be called first, and then the Pod will be deleted.
preStop:
exec:
command:
- /bin/sh
- -c
- /bin/
- Remove Pod from Workload
After the call is down, then modify the label of the Pod, this label modification can make the Pod out of the control of the Workload, become an orphan Pod, note that the modification of the label of the Pod should also be made so that the service's selector can't select the Pod, so that the Pod is also removed from the endpoint, and the service discovery is not able to perceive this node.
- What if the Pod is a consuming business, such as an nsq worker, that doesn't have the ability to initiate downtime?
In this case, you can directly cut off the Pod network, so that the Pod can not receive traffic, cut off the way is also very simple, directly on the Pod to add an iptables rule, the traffic will be all discarded.
/sbin/iptables -A INPUT -s {node_ip}/32 -j ACCEPT && // Allow node access,avoidskubelet livenessinspection failure
/sbin/iptables -A OUTPUT -d {node_ip}/32 -j ACCEPT &&
/sbin/iptables -A OUTPUT -s localhost -d localhost -j ACCEPT &&
/sbin/iptables -A INPUT -s localhost -d localhost -j ACCEPT &&
/sbin/iptables -A INPUT -p tcp --tcp-flags RST RST -j ACCEPT &&
/sbin/iptables -A OUTPUT -p tcp --tcp-flags RST RST -j ACCEPT &&
/sbin/iptables -A INPUT -p tcp -j REJECT --reject-with tcp-reset &&
/sbin/iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset"""