Search

Frequently Asked Questions

Solutions

Uneven DNS Traffic Across CoreDNS Pods

ETCD Backup Cronjob Fails and job pods Report the Status as 'NotReady'

Unable to access the PMK UI due to sunpike-kine issues

'K8s Node Status' Column not Populated after Hard Refresh in PF9 UI

Kubeconfig Cannot be Retrieved With Multi-Factor Authentication (MFA) Enabled

Container Runtime (e.g. Docker) Isn't Started

Host Fails to Apply pf9-kube Role With Error "Certificate is Not Yet Valid"

Node Seen in "Error Authorizing Host" State, pf9-kube Service Fails to Start

Error Initializing Docker: "graphdriver: expected integer"

Error Initializing Docker: "devicemapper: Non existing device docker--vg-docker--pool"

Excessive Kubernetes Master Pod Restarts Due To ETCD Latency.

Applications Failing With Error "504 Gateway Timeout

Uncordon Node After Maintenance to Make it Schedulable Again

pf9-kube Fails to Start on Master Node(s)

Unable to Start pf9-kube Service on Master Node

Docker Container Files or Directories Have 'root:pf9group' (GID: 1001) Permissions

No Applications Available After Adding Helm Repository

Unable to Start pf9-kube Service Due to Package Hold

Node NotReady as Nodelet is Unable To Start PF9-Kube Due to Broken Packages

Pod Stuck in Terminating State Due to Inability to Clean Volume subPath Mount

Node NotReady With Error "container runtime is down, PLEG is not healthy"

Kubernetes Dashboard Inaccessible After v4.3 Upgrade

Docker local-kv.db File Not Found

Kubernetes Not Distributing Pods Evenly Across the Nodes

Degraded LoadBalancer Ingress Endpoint Performance (AWS)

Node Unable to Converge as Docker Group Missing

Master Node on Cluster Recreation Unable to Converge as it Continues to Refer Older Cluster's Etcd Data

Master Node Fails to Converge with Error "client: etcd cluster is unavailable or misconfigured

Docker: "Error response from daemon: client version 1.40 is too new. Maximum supported API version is 1.39

K8s API VIP Interface Undefined: "Cant find interface bond0 for vrrp_instance K8S_APISERVER"

Host Onboarding Command "pf9ctl cluster prep-node" fails on task "Enable and start NTP"

The Command `pf9ctl cluster prep-node` Fails to Onboard The Host Due To Unavailability of The NTP Package on Host

Worker Node Intermittently Entering Into "Ready, SchedulingDisabled" State

Pod 'calico-typha' Not Running on Master Node

Kubectl Exec Times Out After 4 Hours

Netplan Apply Moves Cluster To Pending State

PodCIDR is Missing From The Node Spec of PMK Cluster

Service 'pf9-kubelet' Keeps Restarting

Pod Stuck in Pending State As Container Fails to Mount Persistent NFS Volumes

During Host Onboarding PF9CTL Check Node Fails

Node Attach via U/I Returns Error "StatusCodeError: 404 - Not Found"

Kubernetes Worker Node Reporting NotReady post Kubelet Service Restart

Cluster Operations Fail With Error - etcdserver: mvcc: database space exceeded

Persist Docker Registry Mirrors

Kubelet Stops Posting Node Status

Error "Communication Failure" During Host Onboarding

KubeConfig Download Fails With a Timeout Error

CoreDNS Configuration Not Re-Generated After Reboot

Pods on Master Node Stuck in NodeAffinity Status After Master Node is Rebooted

Calico Pods Fails to Start Resulting in Failure to Establish Communication Between ETCD Pods on Master Nodes

Flannel CNI Failing to Set-Up Pods With Etcd Unreachable

Service pf9-kube fails to start because of a corrupted sudoers file

Unable To Onboard Host by Running 'pf9ctl cluster prep-node' Command as 'Download Platform9 installers' Task Fails

Nodelet Service Fails to Start Causing Convergence Failure as pf9-nodelet Sudoers File is Modified

Host Authorization Failure with Error - Timed Out Waiting For "authn_webhook_listening"

Nodes Report Offline After Attaching to New BareOS Cluster

Issues Retrieving Pod/Container Log Information via Kubectl on Hosts Having Proxy Configured

Nodeletd Restart Failing At Phase 2

PMK Cluster Upgrade Fails Due to Host in Converge Failed State

Reenabling the monitoring fails in v5.2

Multiple Node Attachment for a Cluster Fails Partially if an Offline Node is present

lsb_release Utility Missing Resulting in Node to be in a Failed State

Kubernetes Resource Quota Not Updating

Adding Nodes to a Cluster fails with "skipped caching discovery info due to Get "https://<node_IP>:443/api?timeout=32s": dial tcp <node_IP>:443: connect: no route to host"

Nodes Fluctuating Between Ready and NotReady State Due to Kernel Memory Leak Issue

Unable To Create Clusters With AWS Cloud Provider's Existing VPC

Horizontal Pod Autoscaler Shows <unknown> Targets

Unable to Create Pods on Node With Error "cannot allocate memory"

Monitoring tab Displays the Error "services sys-prometheus not found"

Expedite Pod Scheduling On a Node That has Recovered From Disk-Pressure Eviction

Calico Pod Not Started Resulting in Pod to Pod Communication Failure

Upgrading k8s cluster from v1.19 to v1.20 fails to enable monitoring (OLM pods)

Pods Stuck In the Terminating State Due to Volume Unmount Error.

"ETCD Backup Error: Error listing Cronjobs" Observed on Cluster Details Page.

Persistent Volume is in a "released" State and Failing to Mount Within the Pod

Pod Deployment is Failing Due to Stale Mount Paths of Orphaned Pods On The Node

Pods Stuck in ContainerCreating with Error: "no space left on device"

Worker Nodes in NotReady due to System OOM Encountered

Pods Failing with "cannot allocate memory" Error

Etcd Backup etcdv3_backup.db Not Seen On All Master Nodes

Exec Probe Timeout Fixed from K8s v1.20 Resulting in Calico Pods to Fail Liveness/Readiness Probes as Default Timeout is 1 second

Calico Node Pods in CrashLoopBackOff: "Unable to get Typha service endpoints from Kubernetes"

Kubernetes Node in NotReady State After Reboot for Containerd Runtime Cluster

Error Validating Credentials for AWS Cluster: "InvalidClientTokenId: The security token included in the request is invalid"

Pause Container Deletion Stuck Due To "WorkloadEndpoint not found" Error In CNI Plugin

Single-Sign On (SSO) Authentication Loop With Azure AD

Tuning Kubelet Garbage Collection & Eviction Thresholds for Devicemapper

Certificate Warnings And Connection Resets in Etcd Logs

Multiple Addons Enabled With pf9-kube Upgrade

Kubevirt Cluster Add-ons Failing To Uninstall With Error "kubevirt-operator-webhook not found"

'Contact Us' Link Redirects to Dashboard

Unable To Download KubeConfig When MFA Enabled

Flannel Connecting to Etcd Cluster Using AWS ELB Reports ERROR

Kubelet logs UnmountVolume.NewUnmounter Failure logs On vol_data.json Is Deleted

Kubernetes Master Node in NotReady State With Message "cni plugin not initialized"

BIRD is not ready: Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout

Kubernetes CronJobs Failed with Error "too many missed start time (> 100)"

Failed to Bring Up The Kubernetes Stack Due to Kube-Proxy Container Failure.

Flannel Crashing on Worker Nodes After Upgrading AWS Clusters to v1.22

Prometheus Installed Using Helm Not Alerting PersistsentVolumeClaim Full Alerts.

MetalLB AddOn Failing repository does not exist or may require 'docker login'

StatefulSet PODs Not Getting Deleted After Helm Upgrade

Kubectl Command Gives a Delayed Response

Prometheus Addon Is Not Getting Upgraded During The K8s Upgrade.

etcd-backup addon fails to generate backups if ETCDCTL_CACERT, ETCDCTL_CERT and ETCDCTL_KEY is set in the cronjob

Calico-kube-controllers pod is in CrashLoopBackOff state

Node clock is not synced and shows "node clock drift" issue on GUI

Recover Persistent Volumes from "Terminating" State

Recover Persistent Volumes from "Terminating" State

Luigi Addon upgrade fails while upgrading cluster from 1.21 to 1.22

Kubelet Error "Search Line limits were exceeded, some search paths have been omitted" is Causing Pods to Fail.

New Nodes are Unable to Join a PMK Cluster

How do I Upload Renewed X509 SSO Certificate (via Platform9 UI)?

"loopyWriter.run returning. connection error" messages in the kube-apiserver pod logs

Caching Not working for NodeLocal DNSCache.

Pod Stuck in Terminating State Due to PreStop Hook.

Kubernetes Dashboard Pod Not Running

Set proxy config for PMK Clusters

Revert a PV Stuck in a Deleting State

PMK Stack Restart

Pod "etcd-backup-with-interval-" in "NotReady" State

ETCD Backup Failing With Error Message "Error: could not rename /root/etcd_backup0/.part"

Non-admin Users Unable to Download Kubeconfig

Procedure to Regenerate /etc/nodelet/bootstrap-mgplane/certs/admin.kubeconfig in SMCP Patch1 and Higher

Nodelet Phases Stuck At Master Node Due to CA Certificate Issue, Which In Turn Affected All worker nodes being NotReady State

Node Disconnected From Management Plane Due To Hostagent Certificate expiry.

Nodelet Phases Restart on Master Node Stuck at "Wait for k8s services and network to be up" Stage.

Ability to Define Specific CPUs/Cores for ETCD container

The Kube-Scheduler and Kube-Controller Services Exposed on all Interfaces, Risking External Access

Cluster Stuck in Converging State due to Permission Issue with the file: user-config/.checksum

How to Increase the Validity of Certificate Based Kubeconfig

Pod Creation Failing with Multus pod Error

Whereabouts Pods Failing After Cluster Upgrade (1.24.7 to 1.25)

Pull Images from Internet in a limited Internet Access Environment.

Failure Of Hostagent Extension Scripts Due To Special Characters In /etc/pf9/nodelet/config_sunpike.yaml

Custom Labels Deleted From the Worker Node post Upgrade.

Pod Exec Fails With Permission Issue

High Memory Consumption by Journald Causing Cluster Instability

Graceful Node ShutDown is Not Working.

Pod Allocation Failed Due to Failed to Write Deviceplugin Checkpoint File Error.

Nodelet Fails to Cleanup Data On The Node After it is Removed From The Qbert Cluster

Nodelet Does Not Remove The Node From Kubernetes Cluster When Node Is Removed From Qbert Cluster

Stale APIService Causing API Discovery Failures while running kubectl

pf9ctl Binary Incorrectly Updating Proxy Settings

Cluster Deployment Fails Due To Multiple NICS

Self-Service Users Fails to List Pods, Deployments, and Services in UI

High Disk Usage on / Filesystem Due to Prometheus Data-Agent WAL Files

How Tos

Internal Only

Templates

Nodelet Fails to Cleanup Data On The Node After it is Removed From The Qbert Cluster

Problem

Adding a reused node to the cluster fails to bring up the PMK stack at etcd phase with the below mentioned error in the etcd logs.

Etcd error while bringing up the stack
    
{"log":"{\"level\":\"warn\",\"ts\":\"2023-01-13T05:22:18.859Z\",\"caller\":\"etcdserver/server.go:1095\",\"msg\":\"server error\",\"error\":\"the member has been permanently removed from the cluster\"}\n","stream":"stderr","time":"2023-01-13T05:22:18.860123163Z"}{"log":"{\"level\":\"warn\",\"ts\":\"2023-01-13T05:22:18.860Z\",\"caller\":\"etcdserver/server.go:1096\",\"msg\":\"data-dir used by this member must be removed\"}\n","stream":"stderr","time":"2023-01-13T05:22:18.860139609Z"}
Copy

Environment

Platform9 Managed Kubernetes - v5.4 and above

Cause

When a node is detached from the cluster, qbert will remove the corresponding etcd member of that node from the etcd cluster.
If the etcd container is running on the node when this happens it crashes.
If nodelet is running status check at this point but has not completed the status check of the etcd phase, then the status check for etcd phase fails. This causes nodelet to trigger partial restart of the stack.
The start phase of etcd_run.sh phase keeps failing since the etcd member has already been removed from the etcd cluster. This phase has total retry interval of 900s (90 retries with 10s sleep) i.e. 15 min.
During this time nodelet does not send any status update to sunpike and therefore does not receive any config updates. The last status update generally reported host state as ok which is picked up by qbert and therefore updates the cluster state as success .
Performing a node deauthorization in such state will cause nodelet to not cleanup the /var/opt/pf9/kube/ directory properly.

Resolution

A internal JIRA ticket PMK-5586 has been raised to fix this issue.
The workaround for this issue is, the entire /var/opt/pf9/kube directory should be cleaned up after the node has been deauthorized before it is onboarded again to same or new cluster.
One more thing to ensure is that the value of the config parameter ETCD_ENV must be empty in /etc/pf9/kube.env

Last updated on

Was this page helpful?

On This Page

Nodelet Fails to Cleanup Data On The Node After it is Removed From The Qbert Cluster Problem Environment Cause Resolution