Configuring Persistent Storage
Platform9 Monitoring deploys Prometheus, Alertmanager and Grafana in a single click on all clusters, this deployment leverages ephemeral storage. To configure Platform9 Monitoring to use persistent storage a storage class must be added to the cluster and the monitoring deployment updated to consume the storage class using Kubectl.
To enable persistent storage you must have a Storage Class configured and able to provision persistent volume claims.
Add a Storage Class to Prometheus
The first step is to setup a storage class, if your cluster is running without storage follow the guide to setup the PortWorx CSI.
Once you have a storage class configured run the Kubectl command below to edit the deployment:
kubectl -n pf9-monitoring edit prometheus systemEditing the running configuration uses the linux command line text editor Vi. For help with Vi view this guide.
The default configuration is below, this configuration needs to be updated with a valid storage specification.
# Please edit the object below. Lines beginning with a '#' will be ignored,# and an empty file will abort the edit. If an error occurs while saving this file will be# reopened with the relevant failures.#apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: creationTimestamp: "2021-01-15T18:09:32Z" generation: 1 managedFields: - apiVersion: monitoring.coreos.com/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: {} f:spec: .: {} f:additionalScrapeConfigs: .: {} f:key: {} f:name: {} f:alerting: .: {} f:alertmanagers: {} f:replicas: {} f:resources: .: {} f:requests: .: {} f:cpu: {} f:memory: {} f:retention: {} f:ruleSelector: .: {} f:matchLabels: .: {} f:prometheus: {} f:role: {} f:rules: .: {} f:alert: {} f:scrapeInterval: {} f:serviceAccountName: {} f:serviceMonitorSelector: .: {} f:matchLabels: .: {} f:prometheus: {} f:role: {} manager: promplus operation: Update time: "2021-01-15T18:09:32Z" name: system namespace: pf9-monitoring ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: false kind: Deployment name: monhelper uid: cbc48a82-3c1f-4a2b-9b2a-ebbc32ae2e65 resourceVersion: "2733" selfLink: /apis/monitoring.coreos.com/v1/namespaces/pf9-monitoring/prometheuses/system uid: c1722922-4973-4973-8e29-ba0269ad9a79spec: additionalScrapeConfigs: key: additional-scrape-config.yaml name: scrapeconfig alerting: alertmanagers: - name: sys-alertmanager namespace: pf9-monitoring port: web replicas: 1 resources: requests: cpu: 500m memory: 512Mi retention: 7d ruleSelector: matchLabels: prometheus: system role: alert-rules rules: alert: {} scrapeInterval: 2m serviceAccountName: system-prometheus serviceMonitorSelector: matchLabels: prometheus: system role: service-monitorThe deployment needs to have the following storage section added. The storage class name must be updated to match your cluster and the amount of storage should also be specified.
The storage class in this example is running on Portworx Storage, to add Portworx see the PortWorx CSI guide.
storage: volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi storageClassName: portworx-csi-scThe final configuration should match the configuration below.
# Please edit the object below. Lines beginning with a '#' will be ignored,# and an empty file will abort the edit. If an error occurs while saving this file will be# reopened with the relevant failures.#apiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: creationTimestamp: "2021-02-23T02:06:49Z" generation: 2 managedFields: - apiVersion: monitoring.coreos.com/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: {} f:spec: .: {} f:additionalScrapeConfigs: .: {} f:key: {} f:name: {} f:alerting: .: {} f:alertmanagers: {} f:replicas: {} f:resources: .: {} f:requests: .: {} f:cpu: {} f:memory: {} f:retention: {} f:ruleSelector: .: {} f:matchLabels: .: {} f:prometheus: {} f:role: {} f:rules: .: {} f:alert: {} f:scrapeInterval: {} f:serviceAccountName: {} f:serviceMonitorSelector: .: {} f:matchLabels: .: {} f:prometheus: {} f:role: {} manager: promplus operation: Update time: "2021-02-23T02:06:49Z" - apiVersion: monitoring.coreos.com/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:storage: .: {} f:volumeClaimTemplate: .: {} f:spec: .: {} f:accessModes: {} f:resources: .: {} f:requests: .: {} f:storage: {} f:storageClassName: {} manager: kubectl operation: Update time: "2021-02-23T03:44:02Z" name: system namespace: pf9-monitoring ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: false controller: false kind: Deployment name: monhelper uid: 9dd25109-bef1-4510-b261-16dd7a62d4bd resourceVersion: "169910" selfLink: /apis/monitoring.coreos.com/v1/namespaces/pf9-monitoring/prometheuses/system uid: ce60b009-cb28-4ba9-9db3-9745d14bf267spec: additionalScrapeConfigs: key: additional-scrape-config.yaml name: scrapeconfig alerting: alertmanagers: - name: sys-alertmanager namespace: pf9-monitoring port: web replicas: 1 resources: requests: cpu: 500m memory: 512Mi retention: 7d ruleSelector: matchLabels: prometheus: system role: alert-rules rules: alert: {} scrapeInterval: 2m serviceAccountName: system-prometheus serviceMonitorSelector: matchLabels: prometheus: system role: service-monitor storage: volumeClaimTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi storageClassName: portworx-csi-scTroubleshooting
To see if the deployment is healthy run kubectl -n pf9-monitoring get allThe resulting output should show all services in a running state. If any pods or services are in a creating state rerun the command again.``
If there is an issue the prometheus-system-0 pods will fail to start or enter crashLoopBackoff.
kubectl -n pf9-monitoring get allNAME READY STATUS RESTARTS AGEpod/alertmanager-sysalert-0 2/2 Running 0 9spod/grafana-695dccdd85-97gwb 0/2 ContainerCreating 0 4spod/kube-state-metrics-68dfc664dc-4hgt8 1/1 Running 0 2m12spod/node-exporter-857v7 1/1 Running 0 114spod/node-exporter-98zch 1/1 Running 0 114spod/node-exporter-jkv77 1/1 Running 0 114spod/node-exporter-qq9xf 1/1 Running 0 114spod/prometheus-system-0 3/3 Running 0 9sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 9sservice/grafana-ui ClusterIP 10.21.1.130 <none> 80/TCP 4sservice/kube-state-metrics ClusterIP None <none> 8443/TCP,8081/TCP 2m21sservice/node-exporter ClusterIP None <none> 9100/TCP 118sservice/prometheus-operated ClusterIP None <none> 9090/TCP 9sservice/sys-alertmanager ClusterIP 10.21.1.92 <none> 9093/TCP 9sservice/sys-prometheus ClusterIP 10.21.2.148 <none> 9090/TCP 9sNAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEdaemonset.apps/node-exporter 4 4 4 4 4 kubernetes.io/os=linux 114sNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/grafana 0/1 1 0 4sdeployment.apps/kube-state-metrics 1/1 1 1 2m12sNAME DESIRED CURRENT READY AGEreplicaset.apps/grafana-695dccdd85 1 1 0 4sreplicaset.apps/kube-state-metrics-68dfc664dc 1 1 1 2m12sNAME READY AGEstatefulset.apps/alertmanager-sysalert 1/1 9sstatefulset.apps/prometheus-system 1/1 9sGet Monitoring Pod Status
Run kubectl -n pf9-monitoring describe pod prometheus-system-0and review the events output. The output will show any errors impacting the Pod state. For example, prometheus is failing to start because the PVC cannot be found. To solve this issue the PVC must be manually recreated using Kubectl to apply the Solution example.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" is being deleted Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" not found Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" not foundView Prometheus Container Logs
If the Pod events do not indicate that the issue is within Kubernetes itself it can be useful to look at the container logs for the Prometheus logs. To do this from the Platfrom9 SaaS Management Plane navigate to the Workloads dashboard and select the Pods tab. Filter the table to your cluster and set the namespace to pf9-monitoring. Once the table updates click the view logs link for the prometheus-system-0 container. This will open the container logs in a new tab within your browser.
Below is an example permissions error preventing the Pod from starting on each node.
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:331 msg="Starting Prometheus" version="(version=2.16.0, branch=HEAD, revision=b90be6f32a33c03163d700e1452b54454ddce0ec)"level=info ts=2021-02-23T06:26:49.738Z caller=main.go:332 build_context="(go=go1.13.8, user=root@7ea0ae865f12, date=20200213-23:50:02)"level=info ts=2021-02-23T06:26:49.738Z caller=main.go:333 host_details="(Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021 x86_64 prometheus-system-0 (none))"level=info ts=2021-02-23T06:26:49.738Z caller=main.go:334 fd_limits="(soft=1048576, hard=1048576)"level=info ts=2021-02-23T06:26:49.738Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"level=error ts=2021-02-23T06:26:49.739Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"panic: Unable to create mmap-ed active query loggoroutine 1 [running]:github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7ffca7417a5a, 0xb, 0x14, 0x2c90040, 0xc0006a0510, 0x2c90040) /app/promql/query_logger.go:117 +0x4cdmain.main() /app/cmd/prometheus/main.go:362 +0x5243Incorrect Storage Class Name
If you incorrectly specify the storage class name you will need first update the prometheus configuration and then delete the persistent volume claim by running: kubectl delete pvc <pvc-name> -n pf9-monitoring
Once the PVC is deleted the the Pods will start up and claim a new PVC.