High Disk Usage on / Filesystem Due to Prometheus Data-Agent WAL Files

Prevent unexpected disk usage on your root filesystem in Platform9 Managed Kubernetes by managing the Prometheus agent's Write-Ahead Log (WAL). Learn diagnostics, causes, and resolution steps to keep

Problem

Disk usage on the root filesystem (/) may increase unexpectedly, with most of the space consumed by files under,

/etc/pf9/prometheus/data-agent/wal

Despite Prometheus being configured to use a PersistentVolumeClaim (PVC) for metric data, large Write-Ahead Log (WAL) files accumulate in this local path, consuming significant disk space.

Environment

  • Platform9 Managed Kubernetes - v5.11 and higher.

  • Component: Monitoring (Prometheus Agent)

Cause

This issue occurs when the Prometheus data-agent cannot forward metrics to its remote write endpoint.

During these failures, unsent data is temporarily written to the local WAL directory under:

/etc/pf9/prometheus/data-agent/wal

If the remote write operation remains blocked (for example, due to “context deadline exceeded” or service interruptions), these WAL files grow continuously, eventually filling the host filesystem.

Although Prometheus metric data is stored on a PVC, the agent’s WAL buffer is local and not part of the PVC.

Diagnostics

  1. Check disk usage on affected node

High or growing usage on / and significant space consumed under data-agent confirm the issue.

  1. Check agent service health

Verify whether the Prometheus agent (pf9-prometheus) is running properly and able to send data to its remote write endpoint.

Look for errors such as:

These messages mean the remote write path is blocked, often due to a temporary network interruption.

When this happens, metrics accumulate in the local write-ahead log (WAL) under /etc/pf9/prometheus/data-agent/wal , causing the directory to grow continuously.

  1. Inspect WAL directory

Check the contents of the Prometheus data-agent WAL (Write-Ahead Log) directory to understand how much data has accumulated.

The directories contains sequentially numbered WAL files (e.g., 00000376, 00000377, etc.) confirm backlog accumulation.

These files may be tens or hundreds of megabytes each and if the total usage exceeds a few gigabytes and continues to grow, it means the Prometheus agent is unable to flush data to its remote write endpoint.

Resolution

Follow these steps on each affected node:

  1. Check service status:

  1. Restart communication service:

  1. Stop the Prometheus agent:

  1. Remove old local WAL data:

  1. Monitor for errors:

Confirm no repeating context deadline exceeded messages.

  1. Verify directory size:

The directory should remain small (typically <200 MB).

Additional Information

  • The data-agent WAL directory is a transient local buffer; it is separate from Prometheus PVC storage.

  • The buildup of files in /etc/pf9/prometheus/data-agent/wal usually points to network or service-level issues between pf9-prometheus and its remote write endpoint (localhost:9118).

Last updated