Instance Reboot Failure as Libvirt Failed to Terminate qemu-kvm Zombie Process

Problem

  • An instance which was rebooted goes into a shutdown error state as Libvirt failed to terminate the qemu-kvm process with error "Device or resource busy".
Copy
  • qemu-kvm process with ID 131611 is seen to be in a defunct state with its parent process being systemd.
Copy
  • libvirtd service logs indicate failure to force kill the qemu-kvm process using a SIGKILL signal. The termination failed because the process is in a Zombie state, which means it's already dead or killed and what we see of it are entries in the process table.
Copy

Environment

  • Platform9 Managed OpenStack - All Versions
  • Nova
  • Libvirt/QEMU

Cause

The scenario can be attributed to the fact that the host is overloaded and that the kernel is unable to clean up the process in the time that Libvirt was prepared to wait. If this is the case, the process should eventually go away on its own after a short while and everything should return to normal. Secondly, this can be due to some problems caused by the process being stuck in an un-interruptable wait state due to some configuration in the storage stack, causing I/O read/write operation to hang in kernel space which causes the process to stay around in the zombie state forever, or until the storage problem is resolved.

Resolution

  1. A signal can be sent to the parent process which in this case is " 1", informing it to clean up its dead child process using the command # kill -s SIGCHLD 1. You can refer more options in How To Clean Zombie Processes.
  2. Reboot the host.

Additional Information

RedHat Bug

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard