Image Creation is Failing for Volume Size More than 100GB

Problem

When converting large volumes (100GB–400GB or more) from the block device service into images in the image service, the operation may fail or result in an image being created with zero size.

Environment

  • Private Cloud Director Virtualization - v2025.10

  • Self-Hosted Private Cloud Director Virtualization - v2025.10

  • Component - Image Service

Cause

The issue arises from a combination of buffering behavior and timeout limitations during the volume-to-image conversion process. When a large upload is initiated, the web server first buffers the incoming data temporarily before forwarding it to the image service. The image service, in turn, writes this data to a staging area before finally committing it to backend storage. Although moving the temporary buffering path to a network file system helps reduce local disk exhaustion, it does not eliminate the dependency on internal timeout limits for handling large uploads.

One of the primary causes is that the default timeout configured between the block device service and the image service, typically around 600 seconds, is not sufficient for larger data transfers. As a result, the client connection may get interrupted during the upload process, leading to errors such as a disconnection while sending data. This ultimately results in incomplete uploads, where the image may be marked as successfully created but ends up having zero size due to the partial transfer.

Repeated errors in cindervolume-base.log:

/var/log/pf9/cindervolume-base.log
INFO cinder.image.pf9_glance [req-[REQ_UUID] None [TENANT]] Exception https://[IMAGE_HOST_IP]:9494 calling 'upload' with args ('[IMAGE_UUID]', <_io.BufferedReader name='/opt/pf9/pf9-cindervolume-base/state/mnt/[UUID]/volume-[VOLUME_UUID]'>), {}: Error communicating with https://[IMAGE_HOST_IP]:9494/v2/images/[IMAGE_UUID]/file: HTTPSConnectionPool(host='[IMAGE_HOST_IP]', port=9494): Read timed out. (read timeout=600.0)

Workaround

Update the web server proxy timeout settings to align with the block device service and image service timeout values, ensuring that the entire request flow can support longer upload durations without interruption.

Volume Size

Expected Upload Time*

Recommended Timeout

Safety Buffer

10GB

5-10 minutes

1800s (30 min)

3x buffer

50GB

15-25 minutes

3600s (1 hour)

2.5x buffer

100GB

30-45 minutes

7200s (2 hours)

2.5x buffer

200GB

60-90 minutes

14400s (4 hours)

2.5x buffer

250GB

75-110 minutes

14400s (4 hours)

2x buffer

500GB

150-220 minutes

21600s (6 hours)

2x buffer

*Based on 100 MB/s network speed with 15% overhead

1

Backup the existing config

Take Backup of existing configurations from Image Library Host

Take Backup of existing configuration from Persistent Storage Host

2

Increase image upload timeout

On the Persistent storage host, check if a cinder_override.conf is present. If present update the file to add the timeout. If not create a new file and add the timeout values.

3

Redirect Nginx temporary buffer to shared storage

On the host with Image library role, update

4

Add timeout parameter in pf9_glance.py

Update pf9_glance.py file on Persistent Storage host, and add the code that is highlighted in the below code block:

Post applying above changes restart the service

Additional Information

Nginx temporary buffering requires sufficient disk space, and it is recommended to use an NFS-backed path to prevent exhaustion of the local filesystem during large uploads. The image service staging area will continue to generate temporary files throughout the upload process, which is expected behavior, particularly for large volume-to-image operations.

Currently, manual code changes are being used as a Workaround; however, these changes will NOT persist across upgrades since they are overwritten during package updates and will need to be reapplied if the issue continues. The Engineering team is actively working on a permanent fix under PCD-5923, which is planned to be included in a future release.

Last updated