Troubleshooting Heat Stack Issues
Problem
This guide provides step-by-step instructions for troubleshooting and resolving stack issues in Private Cloud Director.
Environment
- Private Cloud Director Virtualization – v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization – v2025.4 and Higher
Procedure
When troubleshooting stack issues, follow these steps:
1. Identify the Stack Status
$ openstack stack list
Look for statuses like CREATE_IN_PROGRESS
, CREATE_FAILED
, or ROLLBACK_IN_PROGRESS
.
2. Get Stack Information
$ openstack stack show <stack_name or id>
Review stack parameters, outputs, and overall status.
Parameters: Ensure required inputs like image name, flavor, or network ID are correct and exist.
Outputs: Confirm expected outputs (like IP addresses or resource IDs) are present. Missing outputs may indicate failed resource creation.
3. Check Stack Events for Failures
$ openstack stack event list <stack_name or id>
4. Inspect Individual Resource Status
Identify which resource caused the failure. Find out if any resource is stuck in CREATE_IN_PROGRESS
or CREATE_FAILED
.
$ openstack stack resource list <stack_name>
$ openstack stack resource show <stack_name> <failed_resource_name>
$ openstack stack resource list <stack_name>
+----------------+------------------+------------------+---------------------+
| resource_name | resource_type | resource_status | updated_time |
+----------------+------------------+------------------+---------------------+
| my_instance | OS::Nova::Server | CREATE_FAILED | [TIMESTAMP] |
| my_network | OS::Neutron::Net | CREATE_COMPLETE | [TIMESTAMP] |
+----------------+------------------+------------------+---------------------+
$ openstack stack resource show <stack_name> my_instance
attributes: null
creation_time: '[TIMESTAMP]'
logical_resource_id: my_instance
physical_resource_id: [RESOURCE_ID]
resource_action: CREATE
resource_name: my_instance
resource_status: CREATE_FAILED
resource_status_reason: >
Resource creation failed: Quota exceeded for cores: Requested 4, but available 2.
resource_type: OS::Nova::Server
required_by:
- my_instance_floating_ip
updated_time: '[TIMESTAMP]'
- Check where did the Stack Failed. Identify the exact resource(s) and reason(s) for failure during stack creation.
$ openstack stack failures list <stack_id>
//Sample output
Resource: my_instance
Status: CREATE_FAILED
Reason: Image <ubuntu-20.04> could not be found.
6. Check Heat Component Pod Status and Logs
This step is applicable only for self-hosted PCD environments.
Check if the Heat components are running:
$ kubectl get pods -n <workload-region> | grep heat
heat-api-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>
heat-cfn-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>
heat-engine-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>
To check for errors related to the stack or resource ID in the logs, run:
$ kubectl logs -n <workload-region> <heat-engine-pod-name> | grep -i <stack-id or resource-id>
$ kubectl logs -n <workload-region> <heat-api-pod-name> | grep -i <stack-id or resource-id>
Look for stack tracebacks or API-related errors.
7. Validate Stack Template
$ openstack orchestration template validate -f <template_file.yaml>
Ensure the template syntax is correct before deployment.
8. Check Quotas
$ openstack quota show <project_id>
Verify if quotas are causing resource creation failures.
Check both compute and network quotas for the project, and compare them against the requested values in the stack.
key quotas to check:
Compute:
- vCPUs: Requested vCPUs ≤ Available vCPUs
- RAM: Requested RAM ≤ Available RAM
- Instances: Total number of VMs within allowed limit
Network:
- Ports: Requested number of ports ≤ Available quota
- Security Groups: Total security groups ≤ quota
- Floating IPs: Requested number ≤ quota
9. Confirm Resource Availability
$ openstack image list
$ openstack flavor list
Ensure the referenced images and flavors exist.
10. Check Network Connectivity
$ openstack network agent list
$ openstack network show <network_id>
Ensure network components are operational and properly configured.
11. Mark Failed Resource as Unhealthy and Attempt Stack Update
✅ Update the stack only if the issue is small like a typo or a missing value and everything else in the stack is working fine.
❌ Avoid updating if the stack has critical resource failures or dependencies that may cause cascading issues.
If applicable, update the stack with corrected parameters. If the issue persists, consider deleting and redeploying the stack.
$ openstack stack resource mark unhealthy <stack_name> <resource_name>
$ openstack stack update --existing --template <template_file.yaml> <stack_name>
If these steps do not resolve the issue, please contact the Platform9 Support Team for further assistance.
Most common causes:
- Template syntax errors (YAML/JSON issues, missing parameters)
- Resource conflicts (duplicate names, unavailable images)
- Quota limits exceeded (compute, network, or storage)
- Networking issues (missing subnets, no floating IPs)
- Heat engine service failures
- API rate limits exceeded
- Delays in dependent resource creation
- Authentication failures (expired tokens, invalid credentials)