Instance Reboot Failure as Libvirt Failed to Terminate qemu-kvm Zombie Process

Problem

  • An instance which was rebooted goes into a shutdown error state as Libvirt failed to terminate the qemu-kvm process with error "Device or resource busy".

265 WARNING nova.virt.libvirt.driver [req-95e44efb-e57a-4242-8af5-0fb8744f1292 root@org.com Production] [instance: f828064b-91df-4ab1-93f4-2e4d57bb38b2] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 131611 with SIGKILL: Device or resource busy; attempt 3 of 3268 ERROR nova.compute.manager [req-95e44efb-e57a-4242-8af5-0fb8744f1292 root@org.com Production] [instance: f828064b-91df-4ab1-93f4-2e4d57bb38b2] Cannot reboot instance: Failed to terminate process 131611 with SIGKILL: Device or resource busy335 INFO nova.compute.manager [req-95e44efb-e57a-4242-8af5-0fb8744f1292 root@org.com Production] [instance: f828064b-91df-4ab1-93f4-2e4d57bb38b2] Successfully reverted task state from reboot_started_hard on failure for instance.
  • qemu-kvm process with ID 131611 is seen to be in a defunct state with its parent process being systemd.

$ ps -ef | egrep 'PID | 131611'UID         PID   PPID  C STIME TTY          TIME CMDqemu     131611      1 99  2019 ?        274-21:21:49 [qemu-kvm] [defunct]$ ps -ef | head -2UID         PID   PPID  C STIME TTY          TIME CMDroot          1      0  0  2018 ?        15:58:57 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
  • libvirtd service logs indicate failure to force kill the qemu-kvm process using a SIGKILL signal. The termination failed because the process is in a Zombie state, which means it's already dead or killed and what we see of it are entries in the process table.

$ sudo journalctl -u libvirtdJan 19 06:55:57 org.com libvirtd[98498]: 2020-01-19 11:55:57.336+0000: 98502: error : qemuAgentSend:930 : Guest agent is not responding: Guest agent not available for nowJan 19 06:58:12 org.com libvirtd[98498]: 2020-01-19 11:58:12.634+0000: 98499: error : virProcessKillPainfully:401 : Failed to terminate process 131611 with SIGKILL: Device or resource busyJan 19 06:58:27 org.com libvirtd[98498]: 2020-01-19 11:58:27.653+0000: 98501: error : virProcessKillPainfully:401 : Failed to terminate process 131611 with SIGKILL: Device or resource busyJan 19 06:58:42 org.com libvirtd[98498]: 2020-01-19 11:58:42.666+0000: 98504: error : virProcessKillPainfully:401 : Failed to terminate process 131611 with SIGKILL: Device or resource busyJan 21 08:17:31 org.com libvirtd[98498]: 2020-01-21 13:17:31.987+0000: 98501: error : virProcessKillPainfully:401 : Failed to terminate process 131611 with SIGKILL: Device or resource busyJan 21 08:17:31 org.com libvirtd[98498]: 2020-01-21 13:17:31.989+0000: 98500: error : qemuDomainObjBeginJobInternal:4721 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMemoryStats)

Environment

  • Platform9 Managed OpenStack - All Versions

  • Nova

  • Libvirt/QEMU

Cause

The scenario can be attributed to the fact that the host is overloaded and that the kernel is unable to clean up the process in the time that Libvirt was prepared to wait. If this is the case, the process should eventually go away on its own after a short while and everything should return to normal. Secondly, this can be due to some problems caused by the process being stuck in an un-interruptable wait state due to some configuration in the storage stack, causing I/O read/write operation to hang in kernel space which causes the process to stay around in the zombie state forever, or until the storage problem is resolved.

Resolution

  1. A signal can be sent to the parent process which in this case is " 1", informing it to clean up its dead child process using the command # kill -s SIGCHLD 1 . You can refer more options in How To Clean Zombie Processesarrow-up-right.

  2. Reboot the host.

Additional Information

RedHat Bugarrow-up-right

Last updated