• Kubernetes中的网络解析——以flannel为例
    • Flannel
    • Docker
  • 参考

    Kubernetes中的网络解析——以flannel为例

    我们当初使用kubernetes-vagrant-centos-cluster安装了拥有三个节点的kubernetes集群,节点的状态如下所述。

    1. [root@node1 ~]# kubectl get nodes -o wide
    2. NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
    3. node1 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6
    4. node2 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6
    5. node3 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6

    当前Kubernetes集群中运行的所有Pod信息:

    1. [root@node1 ~]# kubectl get pods --all-namespaces -o wide
    2. NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
    3. kube-system coredns-5984fb8cbb-sjqv9 1/1 Running 0 1h 172.33.68.2 node1
    4. kube-system coredns-5984fb8cbb-tkfrc 1/1 Running 1 1h 172.33.96.3 node3
    5. kube-system heapster-v1.5.0-684c7f9488-z6sdz 4/4 Running 0 1h 172.33.31.3 node2
    6. kube-system kubernetes-dashboard-6b66b8b96c-mnm2c 1/1 Running 0 1h 172.33.31.2 node2
    7. kube-system monitoring-influxdb-grafana-v4-54b7854697-tw9cd 2/2 Running 2 1h 172.33.96.2 node3

    当前etcd中的注册的宿主机的pod地址网段信息:

    1. [root@node1 ~]# etcdctl ls /kube-centos/network/subnets
    2. /kube-centos/network/subnets/172.33.68.0-24
    3. /kube-centos/network/subnets/172.33.31.0-24
    4. /kube-centos/network/subnets/172.33.96.0-24

    而每个node上的Pod子网是根据我们在安装flannel时配置来划分的,在etcd中查看该配置:

    1. [root@node1 ~]# etcdctl get /kube-centos/network/config
    2. {"Network":"172.33.0.0/16","SubnetLen":24,"Backend":{"Type":"host-gw"}}

    我们知道Kubernetes集群内部存在三类IP,分别是:

    • Node IP:宿主机的IP地址
    • Pod IP:使用网络插件创建的IP(如flannel),使跨主机的Pod可以互通
    • Cluster IP:虚拟IP,通过iptables规则访问服务

    在安装node节点的时候,节点上的进程是按照flannel -> docker -> kubelet -> kube-proxy的顺序启动的,我们下面也会按照该顺序来讲解,flannel的网络划分和如何与docker交互,如何通过iptables访问service。

    Flannel

    Flannel是作为一个二进制文件的方式部署在每个node上,主要实现两个功能:

    • 为每个node分配subnet,容器将自动从该子网中获取IP地址
    • 当有node加入到网络中时,为每个node增加路由配置

    下面是使用host-gw backend的flannel网络架构图:

    flannel网络架构(图片来自openshift)

    注意:以上IP非本示例中的IP,但是不影响读者理解。

    Node1上的flannel配置如下:

    1. [root@node1 ~]# cat /usr/lib/systemd/system/flanneld.service
    2. [Unit]
    3. Description=Flanneld overlay address etcd agent
    4. After=network.target
    5. After=network-online.target
    6. Wants=network-online.target
    7. After=etcd.service
    8. Before=docker.service
    9. [Service]
    10. Type=notify
    11. EnvironmentFile=/etc/sysconfig/flanneld
    12. EnvironmentFile=-/etc/sysconfig/docker-network
    13. ExecStart=/usr/bin/flanneld-start $FLANNEL_OPTIONS
    14. ExecStartPost=/usr/libexec/flannel/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker
    15. Restart=on-failure
    16. [Install]
    17. WantedBy=multi-user.target
    18. RequiredBy=docker.service

    其中有两个环境变量文件的配置如下:

    1. [root@node1 ~]# cat /etc/sysconfig/flanneld
    2. # Flanneld configuration options
    3. FLANNEL_ETCD_ENDPOINTS="http://172.17.8.101:2379"
    4. FLANNEL_ETCD_PREFIX="/kube-centos/network"
    5. FLANNEL_OPTIONS="-iface=eth2"

    上面的配置文件仅供flanneld使用。

    1. [root@node1 ~]# cat /etc/sysconfig/docker-network
    2. # /etc/sysconfig/docker-network
    3. DOCKER_NETWORK_OPTIONS=

    还有一个ExecStartPost=/usr/libexec/flannel/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker,其中的/usr/libexec/flannel/mk-docker-opts.sh脚本是在flanneld启动后运行,将会生成两个环境变量配置文件:

    • /run/flannel/docker
    • /run/flannel/subnet.env

    我们再来看下/run/flannel/docker的配置。

    1. [root@node1 ~]# cat /run/flannel/docker
    2. DOCKER_OPT_BIP="--bip=172.33.68.1/24"
    3. DOCKER_OPT_IPMASQ="--ip-masq=true"
    4. DOCKER_OPT_MTU="--mtu=1500"
    5. DOCKER_NETWORK_OPTIONS=" --bip=172.33.68.1/24 --ip-masq=true --mtu=1500"

    如果你使用systemctl命令先启动flannel后启动docker的话,docker将会读取以上环境变量。

    我们再来看下/run/flannel/subnet.env的配置。

    1. [root@node1 ~]# cat /run/flannel/subnet.env
    2. FLANNEL_NETWORK=172.33.0.0/16
    3. FLANNEL_SUBNET=172.33.68.1/24
    4. FLANNEL_MTU=1500
    5. FLANNEL_IPMASQ=false

    以上环境变量是flannel向etcd中注册的。

    Docker

    Node1的docker配置如下:

    1. [root@node1 ~]# cat /usr/lib/systemd/system/docker.service
    2. [Unit]
    3. Description=Docker Application Container Engine
    4. Documentation=http://docs.docker.com
    5. After=network.target rhel-push-plugin.socket registries.service
    6. Wants=docker-storage-setup.service
    7. Requires=docker-cleanup.timer
    8. [Service]
    9. Type=notify
    10. NotifyAccess=all
    11. EnvironmentFile=-/run/containers/registries.conf
    12. EnvironmentFile=-/etc/sysconfig/docker
    13. EnvironmentFile=-/etc/sysconfig/docker-storage
    14. EnvironmentFile=-/etc/sysconfig/docker-network
    15. Environment=GOTRACEBACK=crash
    16. Environment=DOCKER_HTTP_HOST_COMPAT=1
    17. Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
    18. ExecStart=/usr/bin/dockerd-current \
    19. --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
    20. --default-runtime=docker-runc \
    21. --exec-opt native.cgroupdriver=systemd \
    22. --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
    23. $OPTIONS \
    24. $DOCKER_STORAGE_OPTIONS \
    25. $DOCKER_NETWORK_OPTIONS \
    26. $ADD_REGISTRY \
    27. $BLOCK_REGISTRY \
    28. $INSECURE_REGISTRY\
    29. $REGISTRIES
    30. ExecReload=/bin/kill -s HUP $MAINPID
    31. LimitNOFILE=1048576
    32. LimitNPROC=1048576
    33. LimitCORE=infinity
    34. TimeoutStartSec=0
    35. Restart=on-abnormal
    36. MountFlags=slave
    37. KillMode=process
    38. [Install]
    39. WantedBy=multi-user.target

    查看Node1上的docker启动参数:

    1. [root@node1 ~]# systemctl status -l docker
    2. docker.service - Docker Application Container Engine
    3. Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
    4. Drop-In: /usr/lib/systemd/system/docker.service.d
    5. └─flannel.conf
    6. Active: active (running) since Fri 2018-02-02 22:52:43 CST; 2h 28min ago
    7. Docs: http://docs.docker.com
    8. Main PID: 4334 (dockerd-current)
    9. CGroup: /system.slice/docker.service
    10. 4334 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --signature-verification=false --bip=172.33.68.1/24 --ip-masq=true --mtu=1500

    我们可以看到在docker在启动时有如下参数:--bip=172.33.68.1/24 --ip-masq=true --mtu=1500。上述参数flannel启动时运行的脚本生成的,通过环境变量传递过来的。

    我们查看下node1宿主机上的网络接口:

    1. [root@node1 ~]# ip addr
    2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    4. inet 127.0.0.1/8 scope host lo
    5. valid_lft forever preferred_lft forever
    6. inet6 ::1/128 scope host
    7. valid_lft forever preferred_lft forever
    8. 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    9. link/ether 52:54:00:00:57:32 brd ff:ff:ff:ff:ff:ff
    10. inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0
    11. valid_lft 85095sec preferred_lft 85095sec
    12. inet6 fe80::5054:ff:fe00:5732/64 scope link
    13. valid_lft forever preferred_lft forever
    14. 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    15. link/ether 08:00:27:7b:0f:b1 brd ff:ff:ff:ff:ff:ff
    16. inet 172.17.8.101/24 brd 172.17.8.255 scope global eth1
    17. valid_lft forever preferred_lft forever
    18. 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    19. link/ether 08:00:27:ef:25:06 brd ff:ff:ff:ff:ff:ff
    20. inet 172.30.113.231/21 brd 172.30.119.255 scope global dynamic eth2
    21. valid_lft 85096sec preferred_lft 85096sec
    22. inet6 fe80::a00:27ff:feef:2506/64 scope link
    23. valid_lft forever preferred_lft forever
    24. 5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    25. link/ether 02:42:d0:ae:80:ea brd ff:ff:ff:ff:ff:ff
    26. inet 172.33.68.1/24 scope global docker0
    27. valid_lft forever preferred_lft forever
    28. inet6 fe80::42:d0ff:feae:80ea/64 scope link
    29. valid_lft forever preferred_lft forever
    30. 7: veth295bef2@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP
    31. link/ether 6a:72:d7:9f:29:19 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    32. inet6 fe80::6872:d7ff:fe9f:2919/64 scope link
    33. valid_lft forever preferred_lft forever

    我们分类来解释下该虚拟机中的网络接口。

    • lo:回环网络,127.0.0.1
    • eth0:NAT网络,虚拟机创建时自动分配,仅可以在几台虚拟机之间访问
    • eth1:bridge网络,使用vagrant分配给虚拟机的地址,虚拟机之间和本地电脑都可以访问
    • eth2:bridge网络,使用DHCP分配,用于访问互联网的网卡
    • docker0:bridge网络,docker默认使用的网卡,作为该节点上所有容器的虚拟交换机
    • veth295bef2@if6:veth pair,连接docker0和Pod中的容器。veth pair可以理解为使用网线连接好的两个接口,把两个端口放到两个namespace中,那么这两个namespace就能打通。参考linux 网络虚拟化: network namespace 简介。

    我们再看下该节点的docker上有哪些网络。

    1. [root@node1 ~]# docker network ls
    2. NETWORK ID NAME DRIVER SCOPE
    3. 940bb75e653b bridge bridge local
    4. d94c046e105d host host local
    5. 2db7597fd546 none null local

    再检查下bridge网络940bb75e653b的信息。

    1. [root@node1 ~]# docker network inspect 940bb75e653b
    2. [
    3. {
    4. "Name": "bridge",
    5. "Id": "940bb75e653bfa10dab4cce8813c2b3ce17501e4e4935f7dc13805a61b732d2c",
    6. "Scope": "local",
    7. "Driver": "bridge",
    8. "EnableIPv6": false,
    9. "IPAM": {
    10. "Driver": "default",
    11. "Options": null,
    12. "Config": [
    13. {
    14. "Subnet": "172.33.68.1/24",
    15. "Gateway": "172.33.68.1"
    16. }
    17. ]
    18. },
    19. "Internal": false,
    20. "Containers": {
    21. "944d4aa660e30e1be9a18d30c9dcfa3b0504d1e5dbd00f3004b76582f1c9a85b": {
    22. "Name": "k8s_POD_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0",
    23. "EndpointID": "7397d7282e464fc4ec5756d6b328df889cdf46134dbbe3753517e175d3844a85",
    24. "MacAddress": "02:42:ac:21:44:02",
    25. "IPv4Address": "172.33.68.2/24",
    26. "IPv6Address": ""
    27. }
    28. },
    29. "Options": {
    30. "com.docker.network.bridge.default_bridge": "true",
    31. "com.docker.network.bridge.enable_icc": "true",
    32. "com.docker.network.bridge.enable_ip_masquerade": "true",
    33. "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
    34. "com.docker.network.bridge.name": "docker0",
    35. "com.docker.network.driver.mtu": "1500"
    36. },
    37. "Labels": {}
    38. }
    39. ]

    我们可以看到该网络中的Config与docker的启动配置相符。

    Node1上运行的容器:

    1. [root@node1 ~]# docker ps
    2. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    3. a37407a234dd docker.io/coredns/coredns@sha256:adf2e5b4504ef9ffa43f16010bd064273338759e92f6f616dd159115748799bc "/coredns -conf /etc/" About an hour ago Up About an hour k8s_coredns_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0
    4. 944d4aa660e3 docker.io/openshift/origin-pod "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0

    我们可以看到当前已经有2个容器在运行。

    Node1上的路由信息:

    1. [root@node1 ~]# route -n
    2. Kernel IP routing table
    3. Destination Gateway Genmask Flags Metric Ref Use Iface
    4. 0.0.0.0 10.0.2.2 0.0.0.0 UG 100 0 0 eth0
    5. 0.0.0.0 172.30.116.1 0.0.0.0 UG 101 0 0 eth2
    6. 10.0.2.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0
    7. 172.17.8.0 0.0.0.0 255.255.255.0 U 100 0 0 eth1
    8. 172.30.112.0 0.0.0.0 255.255.248.0 U 100 0 0 eth2
    9. 172.33.68.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
    10. 172.33.96.0 172.30.118.65 255.255.255.0 UG 0 0 0 eth2

    以上路由信息是由flannel添加的,当有新的节点加入到Kubernetes集群中后,每个节点上的路由表都将增加。

    我们在node上来traceroute下node3上的coredns-5984fb8cbb-tkfrc容器,其IP地址是172.33.96.3,看看其路由信息。

    1. [root@node1 ~]# traceroute 172.33.96.3
    2. traceroute to 172.33.96.3 (172.33.96.3), 30 hops max, 60 byte packets
    3. 1 172.30.118.65 (172.30.118.65) 0.518 ms 0.367 ms 0.398 ms
    4. 2 172.33.96.3 (172.33.96.3) 0.451 ms 0.352 ms 0.223 ms

    我们看到路由直接经过node3的公网IP后就到达了node3节点上的Pod。

    Node1的iptables信息:

    1. [root@node1 ~]# iptables -L
    2. Chain INPUT (policy ACCEPT)
    3. target prot opt source destination
    4. KUBE-FIREWALL all -- anywhere anywhere
    5. KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
    6. Chain FORWARD (policy ACCEPT)
    7. target prot opt source destination
    8. KUBE-FORWARD all -- anywhere anywhere /* kubernetes forward rules */
    9. DOCKER-ISOLATION all -- anywhere anywhere
    10. DOCKER all -- anywhere anywhere
    11. ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
    12. ACCEPT all -- anywhere anywhere
    13. ACCEPT all -- anywhere anywhere
    14. Chain OUTPUT (policy ACCEPT)
    15. target prot opt source destination
    16. KUBE-FIREWALL all -- anywhere anywhere
    17. KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
    18. Chain DOCKER (1 references)
    19. target prot opt source destination
    20. Chain DOCKER-ISOLATION (1 references)
    21. target prot opt source destination
    22. RETURN all -- anywhere anywhere
    23. Chain KUBE-FIREWALL (2 references)
    24. target prot opt source destination
    25. DROP all -- anywhere anywhere /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
    26. Chain KUBE-FORWARD (1 references)
    27. target prot opt source destination
    28. ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000
    29. ACCEPT all -- 10.254.0.0/16 anywhere /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
    30. ACCEPT all -- anywhere 10.254.0.0/16 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED
    31. Chain KUBE-SERVICES (2 references)
    32. target prot opt source destination

    从上面的iptables中可以看到注入了很多Kuberentes service的规则。

    参考

    • coreos/flannel - github.com
    • linux 网络虚拟化: network namespace 简介
    • Linux虚拟网络设备之veth
    • flannel host-gw network
    • flannel - openshift.com