Recovering OpenStack Instances Due to KVM Hypervisor Downtime
This tutorial describes the steps for recovering OpenStack instances on a Linux/KVM hypervisor due to downtime or for maintenance purpose.
Requirements
- Two or more hypervisors running the same KVM version. They should be connected to Platform9 private cloud.
- Shared storage such as NFS or Ceph must be configured identically on hypervisors. In this method, instance disks are not migrated from one storage location to another.
- Working OpenStack CLI installation as detailed here.
Preconditions for NFS Shared Storage
- Make sure all hypervisors in your environment are configured to use same shared storage.
- Make sure the instance storage path is identical on all hypervisors; e.g. if you have two hypervisors “A” and “B” which mount NFS share, the local mount point should be identical for both.
- Make sure that userid and the groupid of the pf9 user is consistent across all hypervisors.
Steps
- Find the instances on the source hypervisor that need to be migrated.
[bash]openstack server list –all-projects –host <hypervisor_uuid>[/bash] Note the instance IDs from above. - Select a target hypervisor to migrate the instances to.
[bash]openstack compute service list –service nova-compute[/bash] Look for hypervisors in the “up” state. Note the UUID of target hypervisor – this will be the value in the Host column. - Perform the migration of a single instance from source to target hypervisor.
[bash]nova evacuate <instance_uuid> <destination_hypervisor> –on-shared-storage[/bash] Repeat for each such instance.
Caveats
- The instance can be rescued only if it was created through Platform9. For example, instances discovered by Platform9 can not be rescued using this method.
OpenStack Cinder and Neutron
The above method works for instances using Neutron-based networks and Cinder volumes. The KVM tap device created by OpenStack Neutron as well as the Cinder initiator is migrated as part of the evacuate command. However, it is important to note that the migration only works if neutron-networking and cinder-volume services are not impacted by the host downtime.