Troubleshooting Heat Stack Issues
Problem
This guide provides step-by-step instructions for troubleshooting and resolving stack issues in Private Cloud Director.
Environment
- Private Cloud Director Virtualization – v2025.4 and Higher
- Self-Hosted Private Cloud Director Virtualization – v2025.4 and Higher
Procedure
When troubleshooting stack issues, follow these steps:
1. Identify the Stack Status
$ openstack stack listLook for statuses like CREATE_IN_PROGRESS, CREATE_FAILED, or ROLLBACK_IN_PROGRESS.
2. Get Stack Information
$ openstack stack show <stack_name or id>Review stack parameters, outputs, and overall status.
Parameters: Ensure required inputs like image name, flavor, or network ID are correct and exist.
Outputs: Confirm expected outputs (like IP addresses or resource IDs) are present. Missing outputs may indicate failed resource creation.
3. Check Stack Events for Failures
$ openstack stack event list <stack_name or id>4. Inspect Individual Resource Status
Identify which resource caused the failure. Find out if any resource is stuck in CREATE_IN_PROGRESS or CREATE_FAILED.
$ openstack stack resource list <stack_name>$ openstack stack resource show <stack_name> <failed_resource_name>$ openstack stack resource list <stack_name>+----------------+------------------+------------------+---------------------+| resource_name | resource_type | resource_status | updated_time |+----------------+------------------+------------------+---------------------+| my_instance | OS::Nova::Server | CREATE_FAILED | [TIMESTAMP] || my_network | OS::Neutron::Net | CREATE_COMPLETE | [TIMESTAMP] |+----------------+------------------+------------------+---------------------+$ openstack stack resource show <stack_name> my_instanceattributes: nullcreation_time: '[TIMESTAMP]'logical_resource_id: my_instancephysical_resource_id: [RESOURCE_ID]resource_action: CREATEresource_name: my_instanceresource_status: CREATE_FAILEDresource_status_reason: > Resource creation failed: Quota exceeded for cores: Requested 4, but available 2.resource_type: OS::Nova::Serverrequired_by:- my_instance_floating_ipupdated_time: '[TIMESTAMP]'- Check where did the Stack Failed. Identify the exact resource(s) and reason(s) for failure during stack creation.
$ openstack stack failures list <stack_id>//Sample outputResource: my_instance Status: CREATE_FAILED Reason: Image <ubuntu-20.04> could not be found.6. Check Heat Component Pod Status and Logs
This step is applicable only for self-hosted PCD environments.
Check if the Heat components are running:
$ kubectl get pods -n <workload-region> | grep heatheat-api-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>heat-cfn-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>heat-engine-xxxxxxxxxx-xxxxx 1/1 Running 0 <age>To check for errors related to the stack or resource ID in the logs, run:
$ kubectl logs -n <workload-region> <heat-engine-pod-name> | grep -i <stack-id or resource-id>$ kubectl logs -n <workload-region> <heat-api-pod-name> | grep -i <stack-id or resource-id>Look for stack tracebacks or API-related errors.
7. Validate Stack Template
$ openstack orchestration template validate -f <template_file.yaml>Ensure the template syntax is correct before deployment.
8. Check Quotas
$ openstack quota show <project_id>Verify if quotas are causing resource creation failures.
Check both compute and network quotas for the project, and compare them against the requested values in the stack.
key quotas to check:
Compute:
- vCPUs: Requested vCPUs ≤ Available vCPUs
- RAM: Requested RAM ≤ Available RAM
- Instances: Total number of VMs within allowed limit
Network:
- Ports: Requested number of ports ≤ Available quota
- Security Groups: Total security groups ≤ quota
- Floating IPs: Requested number ≤ quota
9. Confirm Resource Availability
$ openstack image list$ openstack flavor listEnsure the referenced images and flavors exist.
10. Check Network Connectivity
$ openstack network agent list$ openstack network show <network_id>Ensure network components are operational and properly configured.
11. Mark Failed Resource as Unhealthy and Attempt Stack Update
✅ Update the stack only if the issue is small like a typo or a missing value and everything else in the stack is working fine.
❌ Avoid updating if the stack has critical resource failures or dependencies that may cause cascading issues.
If applicable, update the stack with corrected parameters. If the issue persists, consider deleting and redeploying the stack.
$ openstack stack resource mark unhealthy <stack_name> <resource_name>$ openstack stack update --existing --template <template_file.yaml> <stack_name>If these steps do not resolve the issue, please contact the Platform9 Support Team for further assistance.
Most common causes:
- Template syntax errors (YAML/JSON issues, missing parameters)
- Resource conflicts (duplicate names, unavailable images)
- Quota limits exceeded (compute, network, or storage)
- Networking issues (missing subnets, no floating IPs)
- Heat engine service failures
- API rate limits exceeded
- Delays in dependent resource creation
- Authentication failures (expired tokens, invalid credentials)