DRR Not Triggering VM Migration Under Host Memory Pressure Leading to OOM-Induced VM Crashes

Problem

  • Multiple VMs are unexpectedly transitioning to SHUTOFF state after crashing at the hypervisor level due to host memory pressure (OOM events).

Kernel Logs
Out of memory: Killed process (qemu-system-x86)
CPU x/KVM invoked oom-killer
  • Despite available memory capacity on other hosts in the cluster, Dynamic Resource Rebalancing (DRR) is not triggering live migrations. As a result, memory pressure continues to build on overloaded hosts, and the Linux kernel OOM killer terminates QEMU processes, leading to VM crashes.

Watcher Logs
Node <uuid> overloaded, attempting to reduce load
No destination hosts suggested by nova scheduler

Environment

  • Private Cloud Director Virtualization - v2025.10 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.10 and Higher

Cause

DRR strategy detects overloaded hosts and attempts to rebalance workloads. However, during migration planning, the Nova scheduler does not return any valid destination hosts, even when sufficient capacity is available within the cluster.

This results in:

  • No live migrations being triggered

  • Continued memory pressure on affected hosts

  • Kernel OOM killer terminating qemu-system-x86 processes

  • VMs crashing and transitioning to SHUTOFF state

This indicates a scheduler/placement decision issue rather than actual resource exhaustion.

Workaround

  • Kill the watcher decision engine pod. Once restarted, it re-establishes the connection to RabbitMQ and DRR starts functioning again.

Once the pod restarts, DRR starts working again and migrations may resume.

Additional Information

This issue is currently being tracked by the engineering team under PCD-1854, and a fix is planned to be included in the April 2026 release.

Last updated