Search Domain Resolution Failure by Public Nameservers in Onprem Setup
Problem
Setting public nameservers like [1.1.1.1] and [8.8.8.8] in the /etc/resolv.conf
file of the Management Plane master nodes of the airgapped environment, affecting pods to be in timeout errors:
[ERROR] plugin/errors: 2 keystone. [namespace].svc.cluster.local.abc.com. AAAA: read udp [Pod_IP]:55269→1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 percona-db-pxc-db-haproxy. [namespace].svc.cluster.local.abc.com. AAAA: read udp [Pod_IP]:57739->1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 decco-consul-consul-ui.default.svc.cluster.local.abc.com. A: read udp
• [Pod_IP]:60181→1.1.1.1:53: 1/o timeout
[ERROR] plugin/errors: 2 keystone. [namespace].svc.cluster.local.abc.com. A: read udp
[Pod_IP]:50706→8.8.8.8:53: 1/o timeout
Observing Liveness/Readiness probe failures on pods like Glance-Api, Nova-Api-Osapi, Neutron-Server, etc,
Warning Unhealthy 30m (x830 over 26h) kubelet Liveness probe failed: Get "http://[Pod_IP]:9292/": dial tcp [Pod_IP]:9292: connect: connection refused
Warning Unhealthy 10m (x2745 over 26h) kubelet Readiness probe failed: Get "http://[Pod_IP]:9292/": dial tcp [Pod_IP]:9292: connect: connection refused
Environment
- Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
- Self-Hosted Private Cloud Director Kubernetes - v2025.4 and Higher
- Component: DNS
Cause
CoreDNS demonstrates a behavior in which any error encountered by its plugins can prevent DNS name resolution. In airgapped environments, this results in public nameservers being unreachable, even if they remain listed in /etc/resolv.conf
(via /etc/netplan/50-cloud-init.yaml
file in this case)
sudo cat /etc/netplan/50-cloud-init.yaml •...
vlans
production.91
addresses
Adrdress
id91
gateway4 Gateway address
link LInk Name
mtu9000
nameservers
addresses
8.8.8.8
1.1.1.1
search
abc.com
Here, when the PCD pods attempt to communicate with other pods via DNS, the presence of an additional search domain (e.g., abc.com
) and the absence of an internal nameserver cause DNS queries to be forwarded to upstream nameservers (such as 8.8.8.8 and 1.1.1.1) configured on the nodes. Since internet connectivity is disabled, these queries time out, leading to I/O timeout errors.
This situation generates a high volume of failed DNS requests—approximately 2,800 per 166 seconds -- as pods continuously attempt to resolve names unsuccessfully. The resulting delays cause further instability within the environment, including failures in PCD pod DNS resolution, readiness and liveness probe failures, and overall management plane instability.
Diagnostics
- Timeout errors mentioning the failure at the DNS servers [1.1.1.1:53, and 8.8.8.8:53]
[ERROR] plugin/errors: 2 keystone. namespace.svc.cluster.local.abc.com. AAAA: read udp [Pod_IP]:55269→1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 percona-db-pxc-db-haproxy. namespace.svc.cluster.local.abc.com. AAAA: read udp [Pod_IP]:57739->1.1.1.1:53: i/o timeout
[ERROR] plugin/errors: 2 decco-consul-consul-ui.default.svc.cluster.local.abc.com. A: read udp
[Pod_IP]:60181→1.1.1.1:53: 1/o timeout
[ERROR] plugin/errors: 2 keystone. namespace.svc.cluster.local.abc.com. A: read udp
[Pod_IP]:50706→8.8.8.8:53: 1/o timeout
Check nameserver values in the below configuration files:
$ cat /etc/netplan/50-cloud-init.yaml
...
nameserver 8.8.8.8
nameserver 1.1.1.1
search abc.com
$ cat /etc/resolv.conf
...
nameserver 8.8.8.8
nameserver 1.1.1.1
search abc.com
Resolution
- Remove any unnecessary search domains from your DNS configuration.
- Configure custom or internal nameservers that are accessible within your environment and capable of resolving the required DNS queries.
Validation
- All the pods in the Management Plane cluster are in the running state.
- No Readiness/Liveness probe failures in the OpenStack component pods like Nova, Neutron, Glance etc.