Configuring Persistent Storage
Platform9 Monitoring deploys Prometheus, Alertmanager and Grafana in a single click on all clusters, this deployment leverages ephemeral storage. To configure Platform9 Monitoring to use persistent storage a storage class must be added to the cluster and the monitoring deployment updated to consume the storage class using Kubectl.
To enable persistent storage you must have a Storage Class configured and able to provision persistent volume claims.
Add a Storage Class to Prometheus
The first step is to setup a storage class, if your cluster is running without storage follow the guide to setup the PortWorx CSI.
Once you have a storage class configured run the Kubectl command below to edit the deployment:
kubectl -n pf9-monitoring edit prometheus system
Editing the running configuration uses the linux command line text editor Vi. For help with Vi view this guide.
The default configuration is below, this configuration needs to be updated with a valid storage specification.
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
creationTimestamp: "2021-01-15T18:09:32Z"
generation: 1
managedFields:
- apiVersion: monitoring.coreos.com/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences: {}
f:spec:
.: {}
f:additionalScrapeConfigs:
.: {}
f:key: {}
f:name: {}
f:alerting:
.: {}
f:alertmanagers: {}
f:replicas: {}
f:resources:
.: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:retention: {}
f:ruleSelector:
.: {}
f:matchLabels:
.: {}
f:prometheus: {}
f:role: {}
f:rules:
.: {}
f:alert: {}
f:scrapeInterval: {}
f:serviceAccountName: {}
f:serviceMonitorSelector:
.: {}
f:matchLabels:
.: {}
f:prometheus: {}
f:role: {}
manager: promplus
operation: Update
time: "2021-01-15T18:09:32Z"
name: system
namespace: pf9-monitoring
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: false
controller: false
kind: Deployment
name: monhelper
uid: cbc48a82-3c1f-4a2b-9b2a-ebbc32ae2e65
resourceVersion: "2733"
selfLink: /apis/monitoring.coreos.com/v1/namespaces/pf9-monitoring/prometheuses/system
uid: c1722922-4973-4973-8e29-ba0269ad9a79
spec:
additionalScrapeConfigs:
key: additional-scrape-config.yaml
name: scrapeconfig
alerting:
alertmanagers:
- name: sys-alertmanager
namespace: pf9-monitoring
port: web
replicas: 1
resources:
requests:
cpu: 500m
memory: 512Mi
retention: 7d
ruleSelector:
matchLabels:
prometheus: system
role: alert-rules
rules:
alert: {}
scrapeInterval: 2m
serviceAccountName: system-prometheus
serviceMonitorSelector:
matchLabels:
prometheus: system
role: service-monitor
The deployment needs to have the following storage section added. The storage class name must be updated to match your cluster and the amount of storage should also be specified.
The storage class in this example is running on Portworx Storage, to add Portworx see the PortWorx CSI guide.
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: portworx-csi-sc
The final configuration should match the configuration below.
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
creationTimestamp: "2021-02-23T02:06:49Z"
generation: 2
managedFields:
- apiVersion: monitoring.coreos.com/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences: {}
f:spec:
.: {}
f:additionalScrapeConfigs:
.: {}
f:key: {}
f:name: {}
f:alerting:
.: {}
f:alertmanagers: {}
f:replicas: {}
f:resources:
.: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:retention: {}
f:ruleSelector:
.: {}
f:matchLabels:
.: {}
f:prometheus: {}
f:role: {}
f:rules:
.: {}
f:alert: {}
f:scrapeInterval: {}
f:serviceAccountName: {}
f:serviceMonitorSelector:
.: {}
f:matchLabels:
.: {}
f:prometheus: {}
f:role: {}
manager: promplus
operation: Update
time: "2021-02-23T02:06:49Z"
- apiVersion: monitoring.coreos.com/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:storage:
.: {}
f:volumeClaimTemplate:
.: {}
f:spec:
.: {}
f:accessModes: {}
f:resources:
.: {}
f:requests:
.: {}
f:storage: {}
f:storageClassName: {}
manager: kubectl
operation: Update
time: "2021-02-23T03:44:02Z"
name: system
namespace: pf9-monitoring
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: false
controller: false
kind: Deployment
name: monhelper
uid: 9dd25109-bef1-4510-b261-16dd7a62d4bd
resourceVersion: "169910"
selfLink: /apis/monitoring.coreos.com/v1/namespaces/pf9-monitoring/prometheuses/system
uid: ce60b009-cb28-4ba9-9db3-9745d14bf267
spec:
additionalScrapeConfigs:
key: additional-scrape-config.yaml
name: scrapeconfig
alerting:
alertmanagers:
- name: sys-alertmanager
namespace: pf9-monitoring
port: web
replicas: 1
resources:
requests:
cpu: 500m
memory: 512Mi
retention: 7d
ruleSelector:
matchLabels:
prometheus: system
role: alert-rules
rules:
alert: {}
scrapeInterval: 2m
serviceAccountName: system-prometheus
serviceMonitorSelector:
matchLabels:
prometheus: system
role: service-monitor
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: portworx-csi-sc
Troubleshooting
To see if the deployment is healthy run kubectl -n pf9-monitoring get all
The resulting output should show all services in a running state. If any pods or services are in a creating state rerun the command again.``
If there is an issue the prometheus-system-0
pods will fail to start or enter crashLoopBackoff.
kubectl -n pf9-monitoring get all
NAME READY STATUS RESTARTS AGE
pod/alertmanager-sysalert-0 2/2 Running 0 9s
pod/grafana-695dccdd85-97gwb 0/2 ContainerCreating 0 4s
pod/kube-state-metrics-68dfc664dc-4hgt8 1/1 Running 0 2m12s
pod/node-exporter-857v7 1/1 Running 0 114s
pod/node-exporter-98zch 1/1 Running 0 114s
pod/node-exporter-jkv77 1/1 Running 0 114s
pod/node-exporter-qq9xf 1/1 Running 0 114s
pod/prometheus-system-0 3/3 Running 0 9s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 9s
service/grafana-ui ClusterIP 10.21.1.130 <none> 80/TCP 4s
service/kube-state-metrics ClusterIP None <none> 8443/TCP,8081/TCP 2m21s
service/node-exporter ClusterIP None <none> 9100/TCP 118s
service/prometheus-operated ClusterIP None <none> 9090/TCP 9s
service/sys-alertmanager ClusterIP 10.21.1.92 <none> 9093/TCP 9s
service/sys-prometheus ClusterIP 10.21.2.148 <none> 9090/TCP 9s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/node-exporter 4 4 4 4 4 kubernetes.io/os=linux 114s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/grafana 0/1 1 0 4s
deployment.apps/kube-state-metrics 1/1 1 1 2m12s
NAME DESIRED CURRENT READY AGE
replicaset.apps/grafana-695dccdd85 1 1 0 4s
replicaset.apps/kube-state-metrics-68dfc664dc 1 1 1 2m12s
NAME READY AGE
statefulset.apps/alertmanager-sysalert 1/1 9s
statefulset.apps/prometheus-system 1/1 9s
Get Monitoring Pod Status
Run kubectl -n pf9-monitoring describe pod prometheus-system-0
and review the events output. The output will show any errors impacting the Pod state. For example, prometheus is failing to start because the PVC cannot be found. To solve this issue the PVC must be manually recreated using Kubectl to apply the Solution example.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" is being deleted
Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" not found
Warning FailedScheduling <unknown> default-scheduler persistentvolumeclaim "prometheus-system-db-prometheus-system-0" not found
View Prometheus Container Logs
If the Pod events do not indicate that the issue is within Kubernetes itself it can be useful to look at the container logs for the Prometheus logs. To do this from the Platfrom9 SaaS Management Plane navigate to the Workloads dashboard and select the Pods tab. Filter the table to your cluster and set the namespace to pf9-monitoring. Once the table updates click the view logs link for the prometheus-system-0 container. This will open the container logs in a new tab within your browser.
Below is an example permissions error preventing the Pod from starting on each node.
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:331 msg="Starting Prometheus" version="(version=2.16.0, branch=HEAD, revision=b90be6f32a33c03163d700e1452b54454ddce0ec)"
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:332 build_context="(go=go1.13.8, user=root@7ea0ae865f12, date=20200213-23:50:02)"
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:333 host_details="(Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021 x86_64 prometheus-system-0 (none))"
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:334 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-02-23T06:26:49.738Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2021-02-23T06:26:49.739Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/queries.active err="open /prometheus/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7ffca7417a5a, 0xb, 0x14, 0x2c90040, 0xc0006a0510, 0x2c90040)
/app/promql/query_logger.go:117 +0x4cd
main.main()
/app/cmd/prometheus/main.go:362 +0x5243
Incorrect Storage Class Name
If you incorrectly specify the storage class name you will need first update the prometheus configuration and then delete the persistent volume claim by running: kubectl delete pvc <pvc-name> -n pf9-monitoring
Once the PVC is deleted the the Pods will start up and claim a new PVC.