High Disk Usage on / Filesystem Due to Prometheus Data-Agent WAL Files

Problem

Disk usage on the root filesystem (/) may increase unexpectedly, with most of the space consumed by files under,

Bash
Copy

Despite Prometheus being configured to use a PersistentVolumeClaim (PVC) for metric data, large Write-Ahead Log (WAL) files accumulate in this local path, consuming significant disk space.

Environment

  • Platform9 Managed Kubernetes - v5.11 and higher.
  • Component: Monitoring (Prometheus Agent)

Cause

This issue occurs when the Prometheus data-agent cannot forward metrics to its remote write endpoint.

During these failures, unsent data is temporarily written to the local WAL directory under:

Bash
Copy

If the remote write operation remains blocked (for example, due to “context deadline exceeded” or service interruptions), these WAL files grow continuously, eventually filling the host filesystem.

Although Prometheus metric data is stored on a PVC, the agent’s WAL buffer is local and not part of the PVC.

Diagnostics

  1. Check disk usage on affected node
Command
Copy

High or growing usage on / and significant space consumed under data-agent confirm the issue.

  1. Check agent service health

Verify whether the Prometheus agent (pf9-prometheus) is running properly and able to send data to its remote write endpoint.

Command
Copy

Look for errors such as:

Log
Copy

These messages mean the remote write path is blocked, often due to a temporary network interruption.

When this happens, metrics accumulate in the local write-ahead log (WAL) under /etc/pf9/prometheus/data-agent/wal , causing the directory to grow continuously.

  1. Inspect WAL directory

Check the contents of the Prometheus data-agent WAL (Write-Ahead Log) directory to understand how much data has accumulated.

Command
Copy

The directories contains sequentially numbered WAL files (e.g., 00000376, 00000377, etc.) confirm backlog accumulation.

These files may be tens or hundreds of megabytes each and if the total usage exceeds a few gigabytes and continues to grow, it means the Prometheus agent is unable to flush data to its remote write endpoint.

Resolution

Follow these steps on each affected node:

  1. Check service status:
Command
Copy
  1. Restart communication service:
Command
Copy
  1. Stop the Prometheus agent:
Command
Copy
  1. Remove old local WAL data:
Command
Copy
  1. Monitor for errors:
Command
Copy

Confirm no repeating context deadline exceeded messages.

  1. Verify directory size:
Command
Copy

The directory should remain small (typically <200 MB).

Additional Information

  • The data-agent WAL directory is a transient local buffer; it is separate from Prometheus PVC storage.
  • The buildup of files in /etc/pf9/prometheus/data-agent/wal usually points to network or service-level issues between pf9-prometheus and its remote write endpoint (localhost:9118).
Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard