Slow or Failed Live Migrations for Large Volume Backed Instances
Problem
Slow or Failed Live Migrations are observed for Large Volume Backed Instances with Nova Compute logs indicating memory issues.
Environment
- Platform9 Managed OpenStack - All Versions
- Nova
Cause
With extreme memory write intensive workloads, normal live migration will never complete because the guest is writing to memory faster than Qemu can transfer the memory changes to the destination system. In this case the normal migration will continue forever, not making enough progress to stop the guest and proceed to the non-live "finishing up" phase of migration
This scenario where messages within ostackhost logs with regards to the memory being continuously transferred for the instance and the migration never being completed is recognized in this upstream commit.
If the guest is dirtying memory quicker than the network can transfer it, it is entirely possible that a migration will never complete. In such a case it is highly undesirable to leave it running forever since it wastes valuable CPU and network resources.
As part of that commit, the live_migration_progress_timeout
option was introduced to abort the migration if this didn't progress, but the issue being that the option was deprecated re: 1644248 as it was found out that the data_remaining
parameter was unreliable. It is explicitly stated here Openstack Commit that this option should not be changed and that it may be re-enabled in future releases.
Resolution
One of the option to workaround the issue is to cold-migrate the instance in a shutoff state. Another option is to enable auto-convergence for instance in nova configuration files as described here in Nova Documentation. If the migration is unlikely to complete, this option when pre-enabled on hypervisors will slow down the instance's CPU until the memory copy process is faster than the instance's memory writes .
To enable auto-convergence on the hypervisor.
- Set value
live_migration_permit_auto_converge=true
in nova_override.conf file.
- Restart pf9-ostackhost service for the changes to take effect.
Note: This change will have to be performed on all the hypervisors. Also, keep in mind that a possible downside of auto-convergence is the slowing down of the instance which will impact the performance of the application running inside the instance at the time of live migration.