Networking is a crucial part of Kubernetes. Behind the Kubernetes network, there is a component that work under the hood. It trandlates your Services into some usable networkign rules. This componenet is called Kube-Proxy. kube-proxy
is another per-node daemon in Kubernetes, like Kubelet.
kube-proxy
provides basic load balancing functionality within the cluster. It implements services and relies on Endpoints/EndpointSlices. It may help to reference that section, but the following is the relevant and quick explanation:
- Services define a load balancer for a set of pods.
- Endpoints (and endpoint slices) list a set of ready pod IPs. They are created automatically from a service, with the same pod selector as the service.
Most types of services have an IP address for the service, called the cluster IP address, which is not routable outside the cluster.
kube-proxy
is responsible for routing requests to a service’s cluster IP address to healthy pods. kube-proxy
is by far the most common implementation for Kubernetes services, but there are alternatives to kube-proxy
, such as a replacement mode Cilium.
kube-proxy has four modes, which change its runtime mode and exact feature set:
userspace
iptables
ipvs
kernelspace
You can specify the mode using --proxy-mode <mode>
. It’s worth noting that all modes rely on iptables to some extent.
And You can check the mode that using below command
userspace Mode
Chains
KUBE-PORTALS-CONTAINER
KUBE-NODEPORT-CONTAINER
KUBE-PORTALS-HOST
KUBE-NODEPORT-HOST
The first and oldest mode is userspace mode
. In userspace mode, kube-proxy
runs a web server and routes all service IP addresses to the web server, using iptables. The web server terminates connections and proxies the request to a pod in the service’s endpoints.
userspace mode’s disadvantage
The kube-proxy belongs to the User Space area as it operates as a Process. And Netfilter, which is in charge of Host’s networking, belongs to the Kernel area.
Essentially, the operation of User Space (Process) is done through Kernel. The User Space program is much slower than the Kernel’s own service because it has a system that requests services from Kernel when the Process needs CPU time for calculation, disk for I/O operations, and memory.
The kube-proxy in UserSpace Mode requires a lot of access between UserSpace and Kernel because most networking tasks such as load balancing and packet rule setting are mainly controlled by the kube-proxy itself, which is a process. Because of these issues, the kube-proxy in UserSpace Mode has a problem of slowing networking speed. So userspace mode is no longer commonly used, and we suggest avoiding it unless you have a clear reason to use it.
iptables Mode
-
Since the request packet transmitted from most Pod is delivered to the Host’s Network Namespace through the Pod’s veth, the request packet is delivered to the
KUBE-SERVICES Table
by thePREROUTING Table
.- The request packet sent by Pod or Host Process using Host’s Network Namespace is delivered to the
KUBE-SERVICES Table
by theOUTPUT Table
.
- The request packet sent by Pod or Host Process using Host’s Network Namespace is delivered to the
-
KUBE-SERVICES Table
-
If the Dest IP and Dest Port of the request packet match the IP and Port of the ClusterIP Service, the request packet is forwarded to the NAT table of the matching ClusterIP Service,
KUBE-SVC-XXX Table
. -
If the Dest IP of the request packet is Node’s own IP, the request packet is delivered to the
KUBE-NODEPORTS Table
.
-
-
KUBE-NODEPORTS Table
- If the Dest Port of the request packet matches the Port of the NodePort Service, the request packet is forwarded to the NAT Table of the NodePort Service,
KUBE-SVC-XXX Table
.
- If the Dest Port of the request packet matches the Port of the NodePort Service, the request packet is forwarded to the NAT Table of the NodePort Service,
-
KUBE-SERVICES Table
- If the Dest IP and Dest Port of the request packet match the External IP and Port of the Load Balancer Service, the request packet is delivered to the
KUBE-FW-XXX Table
, the NAT Table of the matching Load Balancer Service, and then to theKUBE-SVC-XXX Table
.
- If the Dest IP and Dest Port of the request packet match the External IP and Port of the Load Balancer Service, the request packet is delivered to the
Chains
KUBE-SERVICES
KUBE-NODEPORTS
KUBE-FW-XXX
KUBE-SVC-XXX
KUBE-SEP-XXX
KUBE-POSTROUTING
KUBE-MARK-MASQ
KUBE-MARK-DROP
Source IP
The Src IP of the Service Request Packet is maintained or SNATed to the IP of the Host through Masquerade. The KUBE-MARK-MASQ
Table is a table that performs marking on the packet for Masquerade of the request packet. The marked packet is Masquerade in the KUBE-POSTROUTING
Table, and the Src IP is SNATed to the Host’s IP. If you look at the iptables tables, you can see that the packet to be performed Masquerade is marked through the KUBE-MARK-MASQ
Table.
If the externalTrafficPolicy
value is set to Local, the KUBE-NODEPORTS
Table disappears from the KUBE-MARK-MASQ
Table, and Masquerade is not performed. Therefore, the Src IP of the request packet is attracted as it is. In addition, the request packet is not Load Balanced on the Host, but is delivered only to the Target Pod driven on the Host where the request packet was delivered. If the request packet is delivered to a Host without a Target Pod, the request packet is dropped.
The left side of the figure below shows a figure that does not perform Masquerade by setting the external Traffic Policy to Local.
ExternalTrafficPolicy
Local is mainly used in LoadBalancer
Service. This is because the Src IP of the request packet can be maintained, and the Load Balancer of the Cloud Provider performs Load Balancing, so the Load Balancing process of the Host is unnecessary.
If the externalTrafficPolicy
value is Local, the packet is dropped on the Host without the Target Pod, so during the Host Health Check process performed by the LoadBalancer of the Cloud Provider, the Host without the Target Pod is excluded from the Load Balancing target. Therefore, Cloud Provider’s Load Balancer only load balances the request packet to the Host with the Target Pod.
Masquerade is also necessary if the request packet is returned to itself by sending the request packet from Pod to the IP of the service to which it belongs.
-
The left side of the figure above shows this case. The request packet is DNATed, and both the Src IP and Dest IP of the packet become Pod’s IP.
-
Therefore, if you send a response packet to a request packet returned from Pod, SNAT is not performed because the packet is processed in Pod without going through the NAT Table of the Host.
-
Masquerade allows you to force the request packet returned to Pod to the Host so that SNAT can be performed. In this way, the technique of deliberately bypassing and receiving a packet is called hairpinning.
- The right side of the above figure shows the case of applying hairpinning using Masqurade.
-
In the
KUBE-SEP-XXX
Table, if the Src IP of the request packet is the same as the IP to be DNAT, that is, if the packet that Pod sent to the Service is received by itself, the request packet is marked through theKUBE-MARK-MASQ
Table and Masquerade in theKUBE-POSTROUTING
Table. -
Since the Src IP of the packet that Pod received is set to the IP of the Host, Pod’s response is delivered to the Host’s NAT Table, and then SNAT, DNAT, and delivered to Pod.
ipvs Mode
-
Since the request packet transmitted from most Pod is delivered to the Host’s Network Namespace through the Pod’s veth, the request packet is delivered to the
KUBE-SERVICES Table
by thePREROUTING Table
.- The request packet sent by Pod or Host Process using Host’s Network Namespace is delivered to the
KUBE-SERVICES Table
by theOUTPUT Table
. (same with iptable mode)
- The request packet sent by Pod or Host Process using Host’s Network Namespace is delivered to the
-
KUBE-SERVICES Table
-
If the Dest IP and Dest Port of the request packet match the IP and Port of the ClusterIP Service, the request packet is delivered to the
IPVS
. -
If the Dest IP of the request packet is Node’s own IP, the request packet is delivered to the
IPVS
via theKUBE-NODE-PORT
Table.
-
-
KUBE-NODEPORTS Table
- If the Dest Port of the request packet matches the Port of the NodePort Service, the request packet is delivered to the
IPVS
via theKUBE-NODE-PORT
Table.- If the Default Rule of the
PREROUTING
andOUTPUT
Table is Accept, the packet delivered to the service is delivered to theIPVS
even without theKUBE-SERVICES
Table, so the service is not affected.
- If the Default Rule of the
- If the Dest Port of the request packet matches the Port of the NodePort Service, the request packet is delivered to the
-
IPVS performs DNAT in the following situations with the port set by Load Balancing and Pod’s IP and Service.
- If the Dest IP, Dest Port in the request packet matches the Cluster-IP and Port in the service,
- If the Dest IP of the request packet is Node’s own IP and the Dest Port matches the NodePort of the NodePort Service,
- If the Dest IP, Dest Port of the request packet matches the External IP and Port of the LoadBalancer Service,
-
The request packet DNATed to Pod’s IP is delivered to the Pod through a container network built through CNI Plugin. [IPVS List] shows that Load Balancing and DNAT are performed for all IPs associated with services.
-
Like iptables, IPVS also uses TCP Connection information of Contrack of Linux Kernel. Therefore, the response packet of the service packet sent by DNAT due to the IPVS is SNATed again by the IPVS and delivered to the Pod or Host Process that requested the service.
-
In IPVS Mode, like the iptables Mode, hairpinning is applied to solve the SNAT problem in the service response packet. In the
KUBE-POSTROUTING
Table, if theKUBE-LOOP-BACK
IPset rule is met, Masquerade is performed.- It can be seen that the
KUBE-LOOP-BACK
IPset contains the number of all cases where the Src IP and Dest IP of the packet can be IP of the same Pod.
- It can be seen that the
Chains
KUBE-SERVICES
KUBE-NODE-PORT
KUBE-LOAD-BALANCER
KUBE-POSTROUTING
KUBE-MARK-MASQ
IPVS List
IPset List
ipvs mode
uses IPVS for connection load balancing. ipvs mode supports six load balancing modes, specified with --ipvs-scheduler
:
rr
: Round-robinlc
: Least connectiondh
: Destination hashingsh
: Source hashingsed
: Shortest expected delaynq
: Never queue
Round-robin (rr
) is the default load balancing mode. It is the closest analog to iptables mode’s behavior (in that connections are made fairly evenly regardless of pod state), though iptables mode does not actually perform round-robin routing.
kernelspace Mode
kernelspace
is the newest, Windows-only mode. It provides an alternative to userspace
mode for Kubernetes on Windows, as iptables
and ipvs
are specific to Linux.
reference