k8s practice - namespace isolation + request-key mechanism to solve CSI kernel state domain name resolution

0x01 Background

Pods need to use remotely stored PVs, storage services provided by services within the same k8s cluster. In the beginning the approach was:

Resolve the clusterIP of the Service in the CSI.
The PV volume is then mounted using clusterIP.

But because of the multiple conversions when going clusterIP:

ClusterIP to Pod IP has gone through 1 NAT.
Pod IP to the final service. After 1 forwarding, the exact performance loss is related to the CNI implementation.

This resulted in a severe performance loss for the final CLIENT writing PVs.

0x02 Solution

Since going through container network leads to poor performance, we modified the deployment form of the server side to hostNetwork, bypassing the container network. However, there is a problem that the storage service may switch nodes, resulting in the client side not being able to reconnect properly (the data inconsistency caused by switching nodes can be handled), which is unacceptable.

New program:
Create a Headless Service for the server, for Deployment type loads Headless Service resolves the list of IP addresses mentioning all the Pods.See the official documentation for detailsSo the only question left is how does the domain name resolve when the client reconnects? Because the driver used is provided by the kernel, it is not possible to use glibc's domain name resolution function directly in the kernel, i.e. it is not possible to use an external DNS Server, even if it is specified in /etc/.

0x03 request-key mechanism

Through the survey it was learned that the kernel providesrequest-key mechanismThe request-key was originally intended for secure token management between the kernel and the userland, but has since been extended for other uses. In the case of the kernel resolving a domain name, the approximate process is as follows:

The kernel initiates domain name resolution to the dns_resolver module [kernel state].
Initiating a request-key request goes to the key management module [kernel state].
The key management module calls /sbin/request-key to the user state [kernel state].
/sbin/request-key is distributed to the corresponding command call based on the configuration in /etc/, example is /sbin/key.dns_resolver [user-state].
/sbin/key.dns_resolver calls glibc domain name resolution, completes the resolution, and calls request-key related system calls to set up the payload, i.e., the IP address corresponding to the domain name [user-state].

But there is a new problem: key.dns_resolver can only resolve domains using /etc/hosts and /etc/ and does not support resolving domains from additional dns servers.

0x04 Specific programs

All methods have to be parsed by modifying the /etc/ configuration file to specify their own program.

The following options are available for the latter process:

Write your own script, specify your own script via the /etc/ configuration file, go through kubectl to query the Pod IP address, call /sbin/request-key to write the result back.

concern: The handling of strings in C conflicts between the dns_resolver and request-key modules, and IP addresses written using /sbin/request-key are considered illegal by the dns_resolver kernel module, which does not work, as explained in the QA section.

Calling key-utils SDK through C can achieve the same function, but basically copy the implementation of key.dns_resolver. Suddenly I thought I could use Python to call the so library, and verified that it basically works. But there is a new problem:

The domain name parsing in Python's standard library also doesn't support specifying a domain name, so if you want to support it, you have to introduce a 3rd-party dns module.

Final Program Comparison:

programmatic	vantage	drawbacks
Write Python to call key-utils SDK so complete IP write back to kernel	Flexible control of access to coreDNS.	You need to call a 3rd party dns resolution service, or access kube-apiserver directly to get the IP, which increases the burden on kube-apiserver.
The shell script generates a temporary /etc/ by unshare mount namespace isolation, calling /sbin/key.dns_resolver to implement the	Without accessing kube-apsierver, the address of coreDNS can be obtained according to the configuration of kubelet, without having to perceive the specific details of DNS resolution. More general, other headless can be used as well	No control over the frequency of calls

Considering that this kind of exception switching resolution is not too frequent, we finally chose the 2nd option. mount namespace can be easily implemented by unshare -m.

0x05 Supplementary QA

Q: /sbin/key.dns_resolver supports resolving domain names from /etc/hosts, why don't you modify /etc/hosts?
A: /etc/hosts is a global configuration, modification conflicts are not easy to control, and the impact is uncontrollable when conflicts occur.

Q: Why can't I change the /etc/ configuration to point to coreDNS?
A: Although coreDNS also supports redirecting non-k8s domains to the specified DNS in /etc/ in the host, this mechanism relies on coreDNS and has a disproportionate impact on the overall system.

Q: Why not use /sbin/request-key to write back the resolved IP address?
A: This implementation was verified and found that the implementations of request-key and dns_resolver have inconsistencies regarding the handling of strings in C. The former has a payload length of unincluded \0 and the latter requires inclusion. This was confirmed by the bpf hook.

0x06 Summary

The problem was solved by trying a variety of solutions, and the most suitable solution was the clever use of namespace isolation, which is the benefit of understanding the underlying principles of containers.
Also bring a little review of the uses of namespaces:

The container does not want to be affected by the host.
Containers that are not expected to affect the host (the scenario in this article) are free to set /etc/.