# Slow or Failed Live Migrations for Large Volume Backed Instances

## Problem

Slow or Failed Live Migrations are observed for Large Volume Backed Instances with Nova Compute logs indicating memory issues.

{% tabs %}
{% tab title="None" %}

```none
INFO nova.compute.manager [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Took 5.12 seconds for pre_live_migration on destination host xyzxyzxy-abd5-48d3-990b-399cae5ae911.
INFO nova.virt.libvirt.migration [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Increasing downtime to 50 ms after 0 sec elapsed time
...
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Migration running for 60 secs, memory 52% remaining; (bytes processed=7190652899, remaining=9001529344, total=17184923648)
...
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Migration running for 12180 secs, memory 5% remaining; (bytes processed=1463919506985, remaining=831111168, total=17184923648)
INFO nova.virt.libvirt.driver [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Data remaining 831111168 bytes, low watermark 2117632 bytes 12622 seconds ago
WARNING nova.virt.libvirt.migration [-] [instance: abcdefgh-1ba4-4403-bf72-a86907034a14] Live migration not completed after 12800 sec
```

{% endtab %}
{% endtabs %}

## Environment

* Platform9 Managed OpenStack - All Versions
* Nova

## Cause

With extreme memory write intensive workloads, normal live migration will never complete because the guest is writing to memory faster than Qemu can transfer the memory changes to the destination system. In this case the normal migration will continue forever, not making enough progress to stop the guest and proceed to the non-live "finishing up" phase of migration

This scenario where messages within ostackhost logs with regards to the memory being continuously transferred for the instance and the migration never being completed is recognized in this [upstream commit](https://review.opendev.org/c/openstack/nova/+/162254).

{% hint style="info" %}
**Info**

If the guest is dirtying memory quicker than the network can transfer it, it is entirely possible that a migration will never complete. In such a case it is highly undesirable to leave it running forever since it wastes valuable CPU and network resources.
{% endhint %}

As part of that commit, the `live_migration_progress_timeout` option was introduced to abort the migration if this didn't progress, but the issue being that the option was deprecated re: [1644248](https://bugs.launchpad.net/nova/+bug/1644248) as it was found out that the `data_remaining` parameter was *unreliable*. It is explicitly stated here [Openstack Commit](f34877f936aa6486dd454495178c3ba49123785dhttps://opendev.org/openstack/nova/commit/f34877f936aa6486dd454495178c3ba49123785d) that this option should not be changed and that it *may* be re-enabled in future releases.

## Resolution

One of the option to workaround the issue is to cold-migrate the instance in a shutoff state. Another option is to enable auto-convergence for instance in nova configuration files as described here in [Nova Documentation](https://docs.openstack.org/nova/pike/admin/live-migration-usage.html). If the migration is unlikely to complete, this option when pre-enabled on hypervisors will slow down the instance's CPU until the memory copy process is faster than the instance's memory writes **.**

To enable auto-convergence on the hypervisor.

1. Set value `live_migration_permit_auto_converge=true` in *nova\_override.conf* file.

{% tabs %}
{% tab title="None" %}

```none
$ cat /opt/pf9/etc/nova/conf.d/nova_override.conf | grep auto_converge
live_migration_permit_auto_converge = True
```

{% endtab %}
{% endtabs %}

2. Restart *pf9-ostackhost* service for the changes to take effect.

{% tabs %}
{% tab title="None" %}

```none
$ systemctl restart pf9-ostackhost
```

{% endtab %}
{% endtabs %}

**Note:** This change will have to be performed on all the hypervisors. Also, keep in mind that a possible downside of auto-convergence is the slowing down of the instance which will impact the performance of the application running inside the instance at the time of live migration.
