High Disk Usage on / Filesystem Due to Prometheus Data-Agent WAL Files
Problem
Disk usage on the root filesystem (/) may increase unexpectedly, with most of the space consumed by files under,
/etc/pf9/prometheus/data-agent/walDespite Prometheus being configured to use a PersistentVolumeClaim (PVC) for metric data, large Write-Ahead Log (WAL) files accumulate in this local path, consuming significant disk space.
Environment
- Platform9 Managed Kubernetes - v5.11 and higher.
- Component: Monitoring (Prometheus Agent)
Cause
This issue occurs when the Prometheus data-agent cannot forward metrics to its remote write endpoint.
During these failures, unsent data is temporarily written to the local WAL directory under:
/etc/pf9/prometheus/data-agent/walIf the remote write operation remains blocked (for example, due to “context deadline exceeded” or service interruptions), these WAL files grow continuously, eventually filling the host filesystem.
Although Prometheus metric data is stored on a PVC, the agent’s WAL buffer is local and not part of the PVC.
Diagnostics
- Check disk usage on affected node
$ df -h /$ du -sh /etc/pf9/prometheus/data-agent$ du -sh /etc/pf9/prometheus/data-agent/walHigh or growing usage on / and significant space consumed under data-agent confirm the issue.
- Check agent service health
Verify whether the Prometheus agent (pf9-prometheus) is running properly and able to send data to its remote write endpoint.
$ sudo systemctl status pf9-prometheus$ sudo journalctl -u pf9-prometheus -f$ sudo cat /var/log/pf9-prometheus.logLook for errors such as:
Failed to send batch, retryingerr="Post \\"http:__localhost:9118_api_v1_write\\": context deadline exceeded"These messages mean the remote write path is blocked, often due to a temporary network interruption.
When this happens, metrics accumulate in the local write-ahead log (WAL) under /etc/pf9/prometheus/data-agent/wal , causing the directory to grow continuously.
- Inspect WAL directory
Check the contents of the Prometheus data-agent WAL (Write-Ahead Log) directory to understand how much data has accumulated.
$ ls -lh /etc/pf9/prometheus/data-agent/wal$ du -sh /etc/pf9/prometheus/data-agent/walThe directories contains sequentially numbered WAL files (e.g., 00000376, 00000377, etc.) confirm backlog accumulation.
These files may be tens or hundreds of megabytes each and if the total usage exceeds a few gigabytes and continues to grow, it means the Prometheus agent is unable to flush data to its remote write endpoint.
Resolution
Follow these steps on each affected node:
- Check service status:
$ sudo systemctl status pf9-prometheus- Restart communication service:
$ sudo systemctl restart pf9-comms- Stop the Prometheus agent:
$ sudo systemctl stop pf9-prometheus- Remove old local WAL data:
$ sudo rm -rf /etc/pf9/prometheus/data-agent/wal- Monitor for errors:
$ sudo journalctl -u pf9-prometheus -fConfirm no repeating context deadline exceeded messages.
- Verify directory size:
$ du -sh /etc/pf9/prometheus/data-agent/walThe directory should remain small (typically <200 MB).
Additional Information
- The data-agent WAL directory is a transient local buffer; it is separate from Prometheus PVC storage.
- The buildup of files in
/etc/pf9/prometheus/data-agent/walusually points to network or service-level issues between pf9-prometheus and its remote write endpoint (localhost:9118).