High Disk Usage on / Filesystem Due to Prometheus Data-Agent WAL Files

Problem

Disk usage on the root filesystem (/) may increase unexpectedly, with most of the space consumed by files under,

Bash
    
 
/etc/pf9/prometheus/data-agent/wal
Copy

Despite Prometheus being configured to use a PersistentVolumeClaim (PVC) for metric data, large Write-Ahead Log (WAL) files accumulate in this local path, consuming significant disk space.

Environment

Platform9 Managed Kubernetes - v5.11 and higher.
Component: Monitoring (Prometheus Agent)

Cause

This issue occurs when the Prometheus data-agent cannot forward metrics to its remote write endpoint.

During these failures, unsent data is temporarily written to the local WAL directory under:

Bash
    
 
/etc/pf9/prometheus/data-agent/wal
Copy

If the remote write operation remains blocked (for example, due to “context deadline exceeded” or service interruptions), these WAL files grow continuously, eventually filling the host filesystem.

Although Prometheus metric data is stored on a PVC, the agent’s WAL buffer is local and not part of the PVC.

Diagnostics

Check disk usage on affected node

Command
    
 
$ df -h /$ du -sh /etc/pf9/prometheus/data-agent$ du -sh /etc/pf9/prometheus/data-agent/wal
Copy

High or growing usage on / and significant space consumed under data-agent confirm the issue.

Check agent service health

Verify whether the Prometheus agent (pf9-prometheus) is running properly and able to send data to its remote write endpoint.

Command
    
 
$ sudo systemctl status pf9-prometheus$ sudo journalctl -u pf9-prometheus -f$ sudo cat /var/log/pf9-prometheus.log
Copy

Look for errors such as:

Log
    
 
Failed to send batch, retryingerr="Post \\"http:__localhost:9118_api_v1_write\\": context deadline exceeded"
Copy

These messages mean the remote write path is blocked, often due to a temporary network interruption.

When this happens, metrics accumulate in the local write-ahead log (WAL) under /etc/pf9/prometheus/data-agent/wal , causing the directory to grow continuously.

Inspect WAL directory

Check the contents of the Prometheus data-agent WAL (Write-Ahead Log) directory to understand how much data has accumulated.

Command
    
 
$ ls -lh /etc/pf9/prometheus/data-agent/wal$ du -sh /etc/pf9/prometheus/data-agent/wal
Copy

The directories contains sequentially numbered WAL files (e.g., 00000376, 00000377, etc.) confirm backlog accumulation.

These files may be tens or hundreds of megabytes each and if the total usage exceeds a few gigabytes and continues to grow, it means the Prometheus agent is unable to flush data to its remote write endpoint.

Resolution

Follow these steps on each affected node:

Check service status:

Command
    
 
$ sudo systemctl status pf9-prometheus
Copy

Restart communication service:

Command
    
 
$ sudo systemctl restart pf9-comms
Copy

Stop the Prometheus agent:

Command
    
 
$ sudo systemctl stop pf9-prometheus
Copy

Remove old local WAL data:

Command
    
 
$ sudo rm -rf /etc/pf9/prometheus/data-agent/wal
Copy

Monitor for errors:

Command
    
 
$ sudo journalctl -u pf9-prometheus -f
Copy

Confirm no repeating context deadline exceeded messages.

Verify directory size:

Command
    
 
$ du -sh /etc/pf9/prometheus/data-agent/wal
Copy

The directory should remain small (typically <200 MB).

Additional Information

The data-agent WAL directory is a transient local buffer; it is separate from Prometheus PVC storage.
The buildup of files in /etc/pf9/prometheus/data-agent/wal usually points to network or service-level issues between pf9-prometheus and its remote write endpoint (localhost:9118).

Last updated on

Was this page helpful?