Metrics-server Pods Are Continuously Restarting With Probe Failures

Problem

Metrics-server pods are restarting with following errors:

shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
timeout waiting for SETTINGS frames from 10.69.69.198:22639
writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
writers.go:130] apiserver was unable to write a fallback JSON response: http: Handler timeout
writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
wrap.go:54] timeout or abort while handling: GET "/apis/metrics.k8s.io/v1beta1"
status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout

controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.<br>","stream":"stderr","time":"2022-06-23T12:37:12.03602478Z"}
available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.21.157.51:443/apis/metrics.k8s.io/v1beta1: Get \"https://10.21.157.51:443/apis/metrics.k8s.io/v1beta1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)<br>","stream":"stderr","time":"2022-06-23T12:37:16.052335263Z"}
handler_proxy.go:102] no RequestInfo found in the context<br>","stream":"stderr","time":"2022-06-23T12:37:17.053434398Z"}
controller.go:116] loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable<br>","stream":"stderr","time":"2022-06-23T12:37:17.053516683Z"}
{"log":", Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]<br>","stream":"stderr","time":"2022-06-23T12:37:17.053519758Z"}

Environment

Platform9 Edge Cloud - v5.3 and above
Metrics-server - v0.5.0

Cause

api-server logs shows the large context deadline exceeded which indicated the CPU resource isn't enough for the pods.

Resolution

Use the following steps to increase the requests and limits for the metrics-server container

# /opt/pf9/qbert/bin/kubectl get clusteraddons <CLUSTERUUID>-metrics-server --kubeconfig='/etc/sunpike/kubeconfig' -o yaml | grep watch
        f:watch: {}
  watch: true

On Edit action, set watch: false

spec:
  clusterID: 8dcfa1b7-366e-4a6a-aa07-6dcbb773bde9
  override:
    params:
    - name: metricsMemoryLimit
      value: 300Mi
    - name: metricsCpuLimit
      value: 100m
  type: metrics-server
  version: 0.5.0
  watch: false

Info

When the watch is disabled you won't see the field watch under spec because it will only show if the watch is set to True.

Scaled down the metrics-server deployment to 0

# kubectl scale deployment --replicas=0 metrics-server-v0.5.0 -n kube-system

To increase CPU to 200M we need to tweak extra-cpu

Command:
      /pod_nanny
      --cpu=40m
      --extra-cpu=10m   <<------
      --minClusterSize=16

Note

In the above example, we wanted to set 100m CPU for metrics server container we increased the extra-cpu to 10 so that CPU will become 200M. The calculation formula is [cpu+(extra-cpu*minClusterSize)]

Scale the metric-server pod replicas back to 1

# kubectl scale deployment --replicas=1 metrics-server-v0.5.0 -n kube-system

Verify the metrics-server pod CPU resource:

-----------------------
    Limits:
      cpu:     200m
      memory:  104Mi
    Requests:
      cpu:        200m
      memory:     104Mi
-----------------------

PreviousDisk Upgrade Failure During Management Plane Upgrade NextSunpike Services Are Failed Due To Dependency On Pf9-vault

Last updated 2 months ago

Good morning

hashtagProblem

hashtagEnvironment

hashtagCause

hashtagResolution

Problem

Environment

Cause

Resolution