Learn why Private Cloud Director is the best VMware alternative

Platform9

Automating Private Cloud Director: GitOps for Tenant Infrastructure with Terraform

In the previous post, I introduced the mental model for CI/CD against Private Cloud Director (PCD): a provisioning phase driven by Terraform, a configuration phase driven by Ansible, and a handoff between them through dynamic inventory. In this post, I’d like to zoom in on the provisioning phase and walk through what a GitOps pipeline for PCD infrastructure actually looks like.

The goal of this post is to get a working pipeline in front of you, end to end. Repo layout, Terraform code, the pipeline workflow, and the reasoning behind a few choices that aren’t obvious at first.

A scoping note before we start: the pipeline in this post manages infrastructure inside a single PCD tenant. That tenant is assumed to already exist, provisioned by your cloud admin team as a separate concern. The pipeline’s credential is scoped to that tenant with the admin role. Tenant provisioning itself, with its own authorization requirements and lifecycle, is a different workflow that I’d like to cover in a future post.

This split matches how most enterprises actually divide the work. Cloud admins provision tenants as a governance boundary. Platform teams manage everything inside. It’s the same pattern VMware shops already use, with vCenter admins creating folders and projects and application teams operating within them.

Why GitOps for PCD Infrastructure

The pitch for GitOps is that Git is the source of truth, and the pipeline is the only thing that changes the target system. For the network and security baseline of a tenant, that means every network, subnet, router, security group, and flavor reference is defined in a Git repo. When somebody wants to change one of those things, they open a pull request. The pipeline plans the change, a reviewer approves, the pipeline applies it. That’s the loop.

PCD already ships audit logs that tell you what happened in the system and when. GitOps adds a second layer: it tells you how the change got requested in the first place. Who opened the PR, who approved it, what commit SHA the pipeline applied, and what run ID produced the change. Those two layers together give you a complete story, from “somebody wanted to change something” through “the system changed.”

The bigger argument, though, is just repeatability. Humans running cloud admin portals by hand get interrupted, forget steps, and make small mistakes. A pipeline doesn’t. It runs the same commands in the same order every time, which is exactly what you want for infrastructure that other people depend on.

Pro tip: All of the code in this blog post is hosted in our Platform9 – Community GitHub organization. Feel free to clone, fork, and submit pull requests to your heart’s desire!

What Lives in the Repo

A minimal repo layout for PCD GitOps looks like this:

Plaintext
pcd-infrastructure/
├── .github/
│   └── workflows/
│       └── terraform.yml
├── environments/
│   └── dev/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── modules/
│   └── tenant-network-baseline/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── README.md

The modules/ directory holds reusable Terraform modules. tenant-network-baseline is a module that creates the standard set of network and security resources an application team needs inside their tenant: a network, a subnet, a router with an external gateway, and a couple of sensible default security groups. The environments/ directory holds environment-specific configurations that call those modules with different inputs. The example shows a single dev environment; production use would add a prod directory alongside it with its own terraform.tfvars.

This is a common Terraform pattern. The point of calling it out is that Git gives you environment separation for free, as long as you structure the repo with that separation in mind from the start.

The Terraform: A Tenant Network Baseline Module

Here’s what a minimal tenant-network-baseline module looks like. It takes a tenant ID and a name prefix, creates a network and subnet, a router attached to an external network, and a couple of sensible default security groups.

HCL
# modules/tenant-network-baseline/main.tf

terraform {
  required_providers {
    openstack = {
      source  = "terraform-provider-openstack/openstack"
      version = "~> 3.4"
    }
  }
}

resource "openstack_networking_network_v2" "internal" {
  name           = "${var.name_prefix}-internal"
  tenant_id      = var.tenant_id
  admin_state_up = true
}

resource "openstack_networking_subnet_v2" "internal" {
  name            = "${var.name_prefix}-internal-subnet"
  network_id      = openstack_networking_network_v2.internal.id
  tenant_id       = var.tenant_id
  cidr            = var.internal_cidr
  ip_version      = 4
  dns_nameservers = var.dns_nameservers
}

resource "openstack_networking_router_v2" "router" {
  name                = "${var.name_prefix}-router"
  tenant_id           = var.tenant_id
  admin_state_up      = true
  external_network_id = var.external_network_id
}

resource "openstack_networking_router_interface_v2" "internal" {
  router_id = openstack_networking_router_v2.router.id
  subnet_id = openstack_networking_subnet_v2.internal.id
}

resource "openstack_networking_secgroup_v2" "web" {
  name        = "${var.name_prefix}-web"
  description = "Allow inbound HTTP and HTTPS"
  tenant_id   = var.tenant_id
}

resource "openstack_networking_secgroup_rule_v2" "web_http" {
  direction         = "ingress"
  ethertype         = "IPv4"
  protocol          = "tcp"
  port_range_min    = 80
  port_range_max    = 80
  remote_ip_prefix  = "0.0.0.0/0"
  security_group_id = openstack_networking_secgroup_v2.web.id
}

resource "openstack_networking_secgroup_rule_v2" "web_https" {
  direction         = "ingress"
  ethertype         = "IPv4"
  protocol          = "tcp"
  port_range_min    = 443
  port_range_max    = 443
  remote_ip_prefix  = "0.0.0.0/0"
  security_group_id = openstack_networking_secgroup_v2.web.id
}

Some notes on this module:

The openstack provider is the one to use against PCD. PCD’s APIs are compatible with the standard OpenStack API surface, which is why this provider works. Pin the provider version in the required_providers block so a provider update doesn’t silently change behavior between runs. For resource and argument details, the PCD API reference at docs.platform9.com/api-docs/ is authoritative.

The module takes the tenant ID as an input rather than creating the tenant itself. This is the architectural split I mentioned earlier: tenant provisioning is a separate workflow, run by cloud admins with different authorization, and the platform-team pipeline operates inside an existing tenant. The credential the pipeline uses is scoped to that tenant with admin role, which is exactly what the OpenStack provider needs to create networks, subnets, routers, and security groups.

The module also takes an external network ID rather than looking it up by name. This keeps the module portable across environments where the external network might be named differently.

The security groups in this example are deliberately minimal. Real environments will want more, and you’ll probably want a base security group that every VM gets by default. Extend the module from there.

The Pipeline: Plan on PR, Apply on Merge

Here’s the GitHub Actions workflow that runs the pipeline.

YAML
# .github/workflows/terraform.yml

name: Terraform

on:
  pull_request:
    paths:
      - 'environments/**'
      - 'modules/**'
  push:
    branches:
      - main
    paths:
      - 'environments/**'
      - 'modules/**'

jobs:
  plan:
    name: Plan
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-24.04
    steps:
      - name: Checkout
        uses: actions/checkout@1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@5e8dbf3c6d9deaf4193ca7a8fb23f2ac83bb6c85 # v4.0.0
        with:
          terraform_version: 1.15.0

      - name: Write clouds.yaml
        run: |
          mkdir -p ~/.config/openstack
          echo "${{ secrets.CLOUDS_YAML }}" > ~/.config/openstack/clouds.yaml
        shell: bash

      - name: Terraform Init
        working-directory: environments/dev
        run: terraform init

      - name: Terraform Plan
        working-directory: environments/dev
        env:
          OS_CLOUD: pcd
        run: terraform plan -out=tfplan

  apply-dev:
    name: Apply (dev)
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-24.04
    steps:
      - name: Checkout
        uses: actions/checkout@1af3b93b6815bc44a9784bd300feb67ff0d1eeb3 # v6.0.0

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@5e8dbf3c6d9deaf4193ca7a8fb23f2ac83bb6c85 # v4.0.0
        with:
          terraform_version: 1.15.0

      - name: Write clouds.yaml
        run: |
          mkdir -p ~/.config/openstack
          echo "${{ secrets.CLOUDS_YAML }}" > ~/.config/openstack/clouds.yaml
        shell: bash

      - name: Terraform Init
        working-directory: environments/dev
        run: terraform init

      - name: Terraform Apply
        working-directory: environments/dev
        env:
          OS_CLOUD: pcd
        run: terraform apply -auto-approve

Two things in this workflow are worth calling out.

Pinned runners and pinned actions

The runner is pinned to ubuntu-24.04 rather than ubuntu-latest. ubuntu-latest is a moving target, and a build that works today can break tomorrow when the label advances to a new OS version.

Every third-party action is pinned by full commit SHA, with the version it corresponds to in a trailing comment. The trailing comment is human-only metadata; GitHub Actions resolves the SHA, not the comment.

The reason this matters is that tags are mutable. If you pin @v3 or @v4, the action author can push a new commit to that tag at any time, and your pipeline picks it up on the next run. There have been real supply chain compromises in the GitHub Actions ecosystem where attackers did exactly that, swapping malicious code into popular tagged actions. A full commit SHA is immutable. The SHA you pinned is the code that will run, until you explicitly update the pin.

To get the SHA for any action, run git ls-remote https://github.com/<org>/<action> refs/tags/<version>. Renovate and Dependabot can both manage SHA pins and keep the trailing version comments in sync when you update. You don’t lose the convenience of automated updates, you just make the updates explicit and reviewable.

PR plan and main-branch apply

The workflow has two jobs. The plan job runs on every pull request and produces a Terraform plan. The apply-dev job runs only on push to main and applies the plan. This is the core GitOps pattern: nothing changes in PCD until a PR is merged, and the merge is the trigger.

This example shows a single environment to keep the focus on the GitOps loop itself. Promoting changes through additional environments (dev → staging → prod, with manual approvals on the higher-risk apply jobs) is a worthwhile follow-on topic that I’d like to cover separately.

Worth noting: this workflow doesn’t include the Terraform plan as a comment on the PR. GitHub Actions can do that with a bit of extra glue, and it’s a common refinement. I’m keeping this example minimal.

What a Change Looks Like in Practice

The pipeline above is abstract until you watch a real change move through it. Here’s what a typical day looks like.

Say a platform engineer needs to deploy network baseline infrastructure into a tenant called analytics-team for the dev environment. The tenant itself has already been provisioned by the cloud admin team, and the engineer has been given the tenant ID and the credentials scoped to it. The tenant-network-baseline module already exists, so the change is just a new call to that module. They edit one file, environments/dev/main.tf, and add a module block:

HCL
module "analytics_team" {
  source = "../../modules/tenant-network-baseline"

  name_prefix         = "analytics-team"
  tenant_id           = var.analytics_team_tenant_id
  external_network_id = var.external_network_id
  internal_cidr       = "10.10.1.0/24"
  dns_nameservers     = ["1.1.1.1", "8.8.8.8"]
}

That’s the entire change. They push the branch and open a PR. GitHub shows the diff: a handful of lines added in one file, nothing else touched. The reviewer can see exactly what’s being requested.

The pipeline picks up the PR and runs the plan job against dev. The dev plan shows the new resources Terraform intends to create for this change and looks something like:

Plaintext
Terraform will perform the following actions:

  # module.analytics_team.openstack_networking_network_v2.internal will be created
  + resource "openstack_networking_network_v2" "internal" {
      + name           = "analytics-team-internal"
      + admin_state_up = true
      + tenant_id      = "0988b8fbbe394a16b86e03640104e742"
    }

  # module.analytics_team.openstack_networking_subnet_v2.internal will be created
  + resource "openstack_networking_subnet_v2" "internal" {
      + name            = "analytics-team-internal-subnet"
      + cidr            = "10.10.1.0/24"
      + ip_version      = 4
      + dns_nameservers = ["1.1.1.1", "8.8.8.8"]
    }

  # ... router, router interface, security group, security group rules

Plan: 7 to add, 0 to change, 0 to destroy.

The plan output sits in the GitHub Actions run log. The reviewer can click into the run, read the plan, and confirm it matches the PR description.

Once the reviewer approves and the PR is merged, the push to main triggers the apply job. The job runs terraform apply against dev first which creates the seven new resources inside the analytics-team tenant. The job finishes green.

At this point, the analytics-team tenant has its baseline networking. The audit log on the PCD side shows the resources being created by the pipeline’s service account. The history on the Git side shows the PR, the approver, the merge commit, and the workflow run that produced the change. Together those two records tell the complete story: who asked, who approved, what the system did.

The thing worth internalizing is that the merge is the change. Before the merge, the new infrastructure is a proposal. After the merge, it’s a fact, and the pipeline’s job is to make the live system match what’s in main. Rollback works the same way: open a PR that removes the module block, get it reviewed, merge. The pipeline plans the deletion, applies it, and the resources are gone. The audit trail shows the rollback the same way it showed the original creation.

Secrets Handling

The pipeline needs PCD credentials to run. Specifically, it needs a clouds.yaml file with valid credentials scoped to the tenant the Terraform code is managing. In the workflow above, clouds.yaml is stored as a GitHub Actions secret and written to disk at the start of each job.

The pattern is: secrets live in the pipeline’s secret store, they’re materialized at runtime, and they’re never committed to the repo. How you manage the secret store itself varies. GitHub Actions secrets work for most teams. Vault, AWS Secrets Manager, and other external stores are options if you need more structure.

A few principles that hold regardless of the store:

  • The account in clouds.yaml should be scoped to the target tenant with the least privilege needed. The pipeline’s service account doesn’t need cloud admin rights, just an admin role on the tenant it’s managing.
  • Credentials should be rotated on a schedule, and the rotation should be automated where possible.
  • Separate environments should use separate credentials. Dev and prod should not share an account.
  • Application credentials are a better fit than user passwords for this use case. They’re tied to a specific user and tenant at issue time, can be revoked independently, and don’t require storing the user’s actual password in the pipeline secret store.

State Management: The Second-Apply Problem

The examples above use Terraform’s default local state backend, which means the state file is created on the GitHub Actions runner during a pipeline run and discarded when the runner exits. The first apply works because Terraform creates resources from scratch, writes its idea of the world to a state file, then the runner goes away. The state file goes with it.

The next pipeline run starts on a fresh runner with no state file. Terraform reads the configuration, sees module calls for resources that should exist, queries nothing about whether they actually do, and concludes “I should create all of these.” But they already exist in PCD from the previous run. The apply hits “name already in use” or similar 409 errors and bails partway through.

This is the most common gotcha when adopting GitOps for Terraform. The pipeline pattern looks complete on paper but falls apart on the second iteration because there’s no durable record of what was created.

Two ways to handle this:

The lab-grade workaround is to recreate from scratch each time you iterate. Delete the resources in PCD between pipeline runs (pcdctl network delete, pcdctl router delete, and so on), let the pipeline apply against a clean tenant, repeat. Acceptable while you’re learning the pattern; not acceptable for anything you care about.

The right answer is a remote state backend. Terraform writes its state file to a shared, durable location instead of the runner’s local disk. Every pipeline run reads the same state file at the start and writes back to it at the end. The pipeline now has continuity across runs.

Production use of this pattern requires a remote state backend with three properties:

  • Remote: not on the runner’s local disk.
  • Locked: concurrent runs can’t corrupt the state by writing simultaneously.
  • Encrypted: the state file contains resource IDs and sometimes secrets, and shouldn’t be readable by anyone with bucket access.

Common backend choices are Terraform Cloud, an S3-compatible object store paired with DynamoDB-style locking, GitLab-managed state if you’re on GitLab, or a Postgres backend if you have a managed Postgres available.

I’m not going to walk through configuring a specific backend here because the right choice depends on your environment. What matters is that you choose one before you take this pattern past the lab. The Terraform documentation on backends is the right starting point: https://developer.hashicorp.com/terraform/language/backend

Out of Scope: Drift Detection, Adopting Existing Environments, and Tenant Provisioning

Three more topics worth mentioning briefly, because they’ll come up once you start running this pipeline for real, but all three are out of scope for this post.

Drift detection. Drift detection is a scheduled job that runs terraform plan and fails if the diff is non-zero. It catches changes that happened outside the pipeline, like somebody logging into the admin portal and clicking something. It’s a governance concern, and a useful one, but it’s not part of the core CI/CD loop. A pipeline is valuable on day one without it. Worth adding once the pipeline is in production and you want to close the “nobody should be clicking in the portal” loop.

Adopting an existing PCD environment. Most readers won’t be starting from a greenfield tenant. They’ll have existing networks, subnets, and security groups created through the admin portal or earlier scripts, and they need to bring those under Terraform management without destroying and recreating them. Terraform supports this natively through import blocks combined with terraform plan -generate-config-out, which together let you generate resource configuration from live state and pull existing resources into the state file. The mechanics are involved enough to warrant their own post, which I’d like to cover later in this series.

Tenant provisioning. This post assumes the tenant already exists. Creating tenants is a separate workflow, run by cloud admins with broader credentials, and it has its own authorization concerns that are different from the platform-team pipeline shown here. I’d like to cover the cloud-admin side as a separate post.

All three of these are good candidates for follow-on posts.

Wrapping Up

That’s the GitOps pattern for tenant infrastructure on PCD in its minimal form: a repo with environment-separated Terraform code, a pipeline that plans on PR and applies on merge, pinned actions and runners for supply chain hygiene, and secrets handled through the pipeline’s secret store. The pipeline runs scoped to a single tenant and manages the network and security baseline that an application team builds on top of. Take it past the lab and the next thing you’ll need is a remote state backend.

In the next post, I’d like to cover the handoff from Terraform to Ansible: how the OpenStack dynamic inventory plugin works, what Terraform needs to emit for Ansible to find the right VMs, and the SSH timing race that shows up when a VM is reachable on the network before cloud-init has finished.

Author

  • Damian Karlson

    Damian leads technical product marketing and community engagement for Private Cloud Director & vJailbreak. Prior to joining Platform9, he had many years at VMware, EMC, and Dell focused on delivering powerful cloud solutions & services.

    View all posts
Scroll to Top