Platform9 Edge Cloud Release Notes

What's new 2023-12-01 Platform9 Edge Cloud 5.3 LTS Patch #15

Platform9 Kube Version: 1.20.15-pmk.2218

Airctl Release Build Version: v-5.3.0-3085281

Security Fixes

Fixed Python for host agent was upgraded from v3.9.10 to v3.9.18 (the latest available patch for Python 3.9)

This version of Python contains a fix for CVE-2023-24329 backported to Python 3.9 from Python 3.11

What's new 2023-06-09 Platform9 Edge Cloud 5.3 LTS Patch #14

Platform9 Kube Version: 1.20.15-pmk.2218

Airctl Release Build Version: v-5.3.0-2710638

Security Fixes

Fixed Fixed the SWEET32 vulnerability in kube-rbac-proxy container running in the hostplumber pod by specifying a set of secure SSL ciphers and TLS min version as 1.2. (This had already been fixed in the luigi-image-controller pod with patch #13)

The full list of vulnerabilities reported with the latest Trivy scan is here. Compare this to the previous 13.0 patch and its list of vulnerabilities here.

Bug Fixes

Fixed Host-agent certificates will be renewed on upgrading to this patch (or higher) from a previous release if they are going to expire in less than 365 days (1 year). In order to force re-generation of hostagent certificates irrespective of their remaining validity, and extend their validity by 3 years when upgrading from patch 12 or lower, follow the steps mentioned in the following knowledge-base article before the upgrade : How to re-generate host certificates

Fixed MongoDB container running in the management plane is stopped at the end of a DU upgrade operation.

Known Issues

Known Issue The cluster CA cannot be rotated with the current version of vault running on the DU. Instead, please fix the TTL on the certs as described in the knowledge base article here.

What's new 2023-05-15 Platform9 Edge Cloud 5.3 LTS Patch #13

This patch is now treated as dead-on-arrival (DOA), and is no longer supported. Please upgrade directly to patch #14 instead.

Platform9 Kube Version: 1.20.15-pmk.2214

Airctl Release Build Version: v-5.3.0-2674451

Bug Fixes

Fixed Security Fixes for all Platform9 components. The full list of vulnerabilities reported with the latest Trivy scan is here. Compare this to the previous 12.1 patch and its list of vulnerabilities here.

Fixed (#1452425,1453440,1453539,1453548,1453605,1453661) Fixed an issue during upgrade where new hostagent certs were not being generated even if they were expiring.

Fixed (#1452736) Added secure ciphers for ports 10257, 10259 and 8443.

Fixed (#1393494) Kubelet now has the --protect-kernel-defaults argument is set to true.

Fixed (#1393494,1393479) Kubernetes API server now has the audit-log arguments set appropriately.

Fixed Expiring CAs and certs will be rotated on upgrade to patch 13. To rotate certs on existing setups, please refer here.

Known Issues

Known Issue The cluster CA cannot be rotated with the current version of vault running on the DU. Instead, please fix the TTL on the certs as described in the knowledge base article above.

Known Issue Hostagent certificates do not get automatically renewed on upgrading to this patch since the hostagent version is same as patch 12.1

Whats new in Supplemental Patches #12.1 & #6.1

2022-07-05 Platform9 Edge Cloud 5.3 LTS Patch #12.1

Platform9 Kube Version: 1.20.15-pmk.2124

Airctl Release Build Version: v-5.3.0-2075501

2022-06-27 Platform9 Edge Cloud 5.3 LTS Patch #6.1

Platform9 Kube Version: 1.20.11-pmk.2119

Airctl Release Build Version: v-5.3.0-2043507

Bug Fixes

Fixed (# 1403090) In order to troubleshoot an issue where Pods were rebooted after etcd an encryption MOP was applied, two updates where added to this patch. The first being a robust API health check and the second was the add-on of logging for master and API containers.

  1. Robust API Health Check - Moved deprecated health check endpoint from healthz to livez to help with grabbing data for the health of the Kube API server.

    1. Validation: Check if /opt/pf9/pf9-kube/vrrp_check_apiserver.sh is using https://127.0.0.1:443/livez endpoint. This should improve health check reliability and will be used by both keepalived and nodelet phases for status checks.
    2. Reference: https://kubernetes.io/docs/reference/using-api/health-checks/
  2. Logging for Master & API Containers - This patch will start writing logs from kube-apiserver, kube-controller-manager and kube-scheduler on master nodes to /var/log/pf9/{kube-apiserver,kube-controller-manager,kube-scheduler} locations. Each of the directory will have a .INFO, .ERROR and .WARNING file logging out corresponding log levels. For example, file names for kube-controller-manager: kube-controller-manager.ERROR, kube-controller-manager.INFO and kube-controller-manager.WARNING.

Patch #6.1 is intended to be applied directly to patch #6 installs. However, once applied to the environment, the upgrade path must now be to patch #12.1, a cumulative update that will include all fixes from patches #7 - #12.

What's new 2022-04-13 Platform9 Edge Cloud 5.3 LTS Patch #12

Platform9 Kube Version: 1.20.15-pmk.2100

Airctl Release Build Version: v-5.3.0-1911578

This build resolves an issue in the previous build when upgrading from a prior patch version.

Bug Fixes

Fixed (# 1389145) Resolved a Whereabouts issue where Pods contained duplicate IP addresses. The issue was due to IP Reconciler incorrectly removing newly created podref in whereabouts Ippool on pod reschedule post node reboot which can lead to duplicate IP issue.

Fixed (#1390411 & #1397032) Resolved an issue where the ip-reconciler job is failing as it cannot retrieve IP Pools and failed to update the reservation list. An upstream bug has been filled with whereabouts concerning this issue as well, : ip-reconciler fails with "context deadline exceeded" while listing IPPools · Issue #172 · k8snetworkplumbingwg/whereabouts

Fixed (#1393457) Whereabouts v0.4.7 is not clearing ippools.

Fixed (#1390330) Resolved an issue where cluster pods are seen to enter image pull backoff state during cluster upgrade from K8s v1.19 to K8s v1.20 after DU upgrade from v5.1 to v5.3 LTS with private registry enabled.

Fixed (#1386405) Fixed an issue where cluster nodes are sending ICMP requests to 8.8.8.8 (Google DNS) in regular intervals for an on-premises deployment. This is due to the network status(ping-report) check which is part of pf9-muster service on every node. Issue resolved by removing the line "_task pingmaker_” from muster.conf

Fixed (#1384966) Resolved issue where export kubeconfig is not working due to authentication error from keystone not being available, which is essentially a dependency on the DU being up. This is valid for the following scenarios:

  1. When keystone enabled, the kubeconfig cannot be used if the DU is powered off.
  2. When keystone is disabled during the cluster creation (with flag "keystoneEnabled"), the downloaded kubeconfig does work for doing cluster operations even when the DU is powered off. The kubeconfig will use the username/password for keystone if "keystone is enabled". Only when it is "disabled", you will get the right kubeconfig with just the certificates.

Fix is provided for the following use cases:

  1. Token
  2. User/Password
  3. Certificate based (w/RBAC)

https://platform9.com/docs/kubernetes/kubeconfig-through-api

"If force_cert_auth query param set to true, kubeconfig would contains certificate based authentication, otherwise it’ll be token based"

NOTE: See #1397831 to view the steps to manually set the expiry of the client certificate in the Kubeconfig.

Fixed (#1393364) Fixed bug that occurs post cluster upgrade from K8s v1.19 to K8s v1.20, the etcd-backup cronjob resource is not recreated and thus it continues to create jobs/pod with older image referring to the etcdVersion 3.3.22 from K8s v1.19.

Fixed (#1390347) This fix resolves several issues.First, it provides the ability to have calico related pods's cpu & memory limits configurable on existing cluster resources without requiring an upgrade (e.g. CPUThrottling in calico pods) via API. This fix also resolves an issue where the addon operator memory usage keeps increasing until reaching the upper limit and gets OOM killed in a few hours.

How to edit limits for Calico Pods

  1. Create cluster with POST API (The calico related fields in body are included as an example.)
YAML
Copy
  1. Update existing cluster with PUT API.
YAML
Copy
  1. Restart the stack on the master node.
Bash
Copy

For API reference: https://platform9.com/docs/v5.3/qbert/ref#putupdate-the-properties-of-a-cluster-specified-by-the-cluster-u

  • In addition to the above, the fix for a bug where the metrics-server does not respect Kubernetes CPU limits is included. To fix this issue, edit the metrics-server deployment and increase the --cpu parameter for the metrics-server-nanny container. A good value for most clusters is 100m. See Example below.
Bash
Copy
Current resources set for metrics-server-nanny container
Copy

How to edit limits for Metrics-Server pod

Steps to manually change the override parameters for the metrics-server pod:

  1. API for patching the Cluster AddOn object of metrics-server.
Export Arguments
Copy
Export Arguments Example
Copy
  1. Create metric.json file. Example metric.json with the override parameters introduced which can be updated.
Example JSON File
Copy
PATCH API Call
Copy

Using Kubectl

  1. Root login to the DU VM.
ClusterAddon Edit Example
Copy
  1. Update the metricsMemoryLimit & metricsCpuLimit accordingly.
YAML Example
Copy
  1. On the master node, edit the pf9-addon-operator deployment to use the 3.2.3 addon image. After few seconds/mins, we can see the metrics-server pod with the desired values.

Note: Only updating the deployment to use the 3.2.3 addon image will also be sufficient. No need of step 1 or 2. As in that case, we have given some default value of 100m for metrics-server-nanny container/pod and the --cpu parameter.

Fixed (#1393493) Resolved an issues with critical kube bench scan failures - api server flags. This resulted in being unable to apply the fix due to an apiserver-config volume mount is read only which results in "audit plugin can't open new logfile: open /var/opt/pf9/kube/apiserver-config/audit.log: read-only file system"

Verified with the following API flags and YAML.

Bash
Copy
YAML
Copy

Fixed (#1385068) Resolved issue where pf9-monitoring namespace pods are not getting deployed even if monitoring is enabled due to the forwarder using insecure k8sApiPort when keystone is disabled during cluster creation.

Fixed (#1385072) Resolved an issue where Grafana Isn't available in IPv4 cluster although monitoring is enabled due to the forwarder using insecure k8sApiPort when keystone is disabled during cluster creation.

Fixed (#1385073) Resolved an issue where Grafana creates a security issue by sending statistics to unknown internet domains.

Fixed (#1393228) Resolved an issue where services in the DU fail and the DU doesn't show clusters because the bouncer container was resolving the localhost to an IP from the DNS server instead of the /etc/hosts file.

Enhancements & Updates

Added (#1393389) Added support for Python 3.9.0. Support was added to Python 3.9.0 to address Python 3.6 being End-Of-Life.

Added (#1384940) Added Api based procedure to create user with different role and generate kubeconfig for same with specific RBAC roles. This solves the problem when using the non-admin user's kubeconfig having the admin user's key and certificate when keystone is disabled in the cluster.

Enhanced (#1397831) Added the capability to set the validity of client certificate in Kubeconfig. Patch #12 provides a way to manually set expiry of the client certificate in the Kubeconfig with a Qbert parameter certExpiryHrs, which can be passed during the cluster creation. The default validity of client certificate remains 24hrs, if the certExpiryHrs parameter is not passed during cluster creation.

Procedure

  1. During cluster creation, pass an integer argument to *_certExpiryHrs *within the cluster creation payload. Example: has _"certExpiryHrs":36.
Example Payload
Copy
API to fetch nodePoolUuid for type: local
Copy
  1. Create a cluster using the above payload:
Create API Call
Copy
Create API Call Example
Copy
  1. Once the cluster is ready, generate Kubeconfig for user with client-certificate:
API
Copy
API Example
Copy
  1. Verify the validity of client certificate with the following command:
Bash
Copy

Validation

  1. Certificate expiry obtained from Decoded PEM certificate:
Example Output
Copy
  1. Check_ certExpiryHrs _parameter set value for the cluster:
GET API Call to check set "certExpiryHrs" value
Copy
GET API Call Example to check set "certExpiryHrs" value
Copy

Added (AIR-304) Upstream contribution to Whereabouts to resolve an issue ip-reconciler: StatefulSet recreation while node down may result in the wrong IP cleaned up.

Added (AIR-322) Proactively identified and addressed high and critical security vulnerabilities, Updated older software components to resolve security and bug concerns.

What's new 2022-01-14 Platform9 Edge Cloud 5.3 LTS Patch #11

Platform9 Kube Version: 1.20.11-pmk.2038

Airctl Release Build Version: v-5.3.0-1806225

Bug Fixes

Fixed (# 1390330) Resolved an issue where 5.1-5.3 LTS & 1.19-1.20 upgrade with Private Registry enabled results in cluster pods to go into an ImagePullBackOff state due to missing comms registry.json file

Fixed (# 1385115) Added the ability to skip gen_certs phase when DU VM is offline and the PMK stack is restarted using nodelet phases. This resolves the issue when the DU VM is offline and the PMK stack is restarted explicitly using nodeletd phases restart, then on stop action the phase gen_certs is called which removes the certificates, and on start action, it fails to fetch them as the DU is unavailable. Note - after a node reboot, the pf9-nodeletd service will skip running the stop and start functions of the gen_certs phase to avoid reaching out to the DU VM.

Fixed (# 1346071) Fixed an issue where pf9-kube fails to start when the node is rebooted during absence of the DU. With this fix implemented on-premises DUs can be offline for extended durations and pf9-kube will restart correctly if the node is rebooted.

What's new 2021-12-22 Platform9 Edge Cloud 5.3 LTS Patch #10

Platform9 Kube Version: 1.20.11-pmk.2032

Bug Fixes

Fixed (#1385819) Fixed an issue when the CoreDNS pod is hard shutdown. There is an approximately 6 min window (kubelet to detect node not ready + 5m pod eviction timeout) where there are no working coredns, and the cluster DNS resolve fails, resulting in downtime. Implemented DNS autoscaler in 5.3/1.20 to increase the minimum value of CoreDNS pods to 2 on different nodes to enable High Availability. For those who want to scale coredns replicas, this add-on will take care of scaling CoreDNS replicas depending on the number of cores / nodes in the cluster.

With patch 10, we are now deploying the autoscaler by default with a replica size of 2. This also means that we don’t deploy the coredns pod anymore. It will be entirely managed by the DNS autoscaler.

The DNS autoscaler is deployed by default along with CoreDNS.

The default autoscaling params are as follows:

MinReplicas = 1

MaxReplicas = 10

PreventSinglePointFailure = true (On nodes with 2 or more nodes, ensures at least 2 CoreDNS replicas)

NodesPerReplica = 16

CoresPerReplica = 256

Note that the values of both coresPerReplica and nodesPerReplica are floats. The idea is that when a cluster is using nodes that have many cores, coresPerReplica dominates. When a cluster is using nodes that have fewer cores, nodesPerReplica dominates. The fields will schedule 1 replica per X nodes (16 in the example shown above). This is all bounded by the MinReplicas and Max fields. So if you want 2 replicas, even though you have 32 or less nodes (assuming NodesPerReplicas is 16), set MinReplicas to 2. The MaxReplicas would prevent it from scheduling too many, even on a setup with many many nodes (or cores).The default polling period is set to 5 mins.

If you wish to change the CoreDNS or Autoscaling params from the defaults, you can to modify the CoreDNS ClusterAddon resource via the Qbert Sunpike API:

$DU_FQDN/qbert/v4/$PROJECT_ID/sunpike/apis/sunpike.platform9.com/v1alpha2/namespaces/default/clusteraddons/$CLUSTER_UUID-coredns

JSON
Copy

As an example, save the above JSON example to a file "coredns.json". Please note that the above example changes the defaults of CoresPerReplica, NodesPerReplica, MinReplicas, and MaxReplicas. Do not change these fields if you wish to use the defaults.

An example PATCH call:

Bash
Copy

You can verify the ClusterAddon was modified successfully by making a GET call to:

https://$DUFQDN/qbert/v4/9d75c963d90b4d659eb541a655608839/sunpike/apis/sunpike.platform9.com/v1alpha2/namespaces/default/clusteraddons

Verifying the status phase for the Addon shows "Installed". You should also see a kube-dns-autoscaler Deployment in your cluster, along with the coredns pods scaled.

Fixed (#1385846) Resolved an issue where the PMK stack upgrade is not sequential due to a race condition between Nodelet & Sunpike/Qbert. In this scenario, the Nodelet provides a prior positive status of the cluster upgrade of the 1st worker node before it ran into a docker mirror fetching issues. Qbert/Sunpike started the 2nd worker node cluster upgrade due to the incorrect positive status of the 1st worker node, resulting in both the worker nodes ending up in a not ready state.

What's new 2021-12-14 Platform9 Edge Cloud 5.3 LTS Patch #9

Platform9 Kube Version: 1.20.11-pmk.2013

Enhancements & Updates

Added (#1385242) Added the ability to schedule the IP-reconciler job on a specific node using a node selector with the whereabouts plugin. The new field is called ipReconcilerNodeSelector, and is located within the whereabouts plugin section.

As an example, below will deploy whereabouts, and also the ip-reconciler CronJob on a 3 minute schedule only on nodes with the label "foo=bar".

YAML
Copy

Added (#1385819) Implemented DNS autoscaler in 5.3/1.20 to increase the minimum value of CoreDNS pods to 2 on different nodes to enable High Availability. For those who want to scale coredns replicas, this add-on will take care of scaling CoreDNS replicas depending on the number of cores / nodes in the cluster. Note that it does not scale replicas based on load, but based on the cluster size. In this iteration of the fix required the admin to explicitly enable the dns autoscaler add-on on the DU. The default was still to not deploy it.

Enhanced (#1385115) Added the ability to skip gen_certs phase when the DU VM is offline (in Airgap mode) and PMK stack is restarted using nodelet phases. This resolves the issue when the DU VM is offline and the PMK stack is restarted explicitly using nodeletd phases restart, then on stop action the phase gen_certs is called which removes the certificates, and on start action, it fails to fetch them as the DU is unavailable. Note - after a node reboot, the pf9-nodeletd service will skip running the stop and start functions of the gen_certs phase to avoid reaching out to the DU VM.

Enhanced Finalizers were added to the Luigi controller for proper cleanup. Now when the NetworkPlugins CRD is deleted and operator removed, all the managed network plugins, and resources they created, will also be deleted.

Bug Fixes

Fixed (#1357918) Fixed an issue where some docker image include tags "latest" instead of proper version number. This may result in the installation of an undesired image (wrong version) with a negative impact on the platform.

  1. quay.io/operator-framework/configmap-operator-registry latest c22c74d5b16d
  2. nfvpe/sriov-device-plugin latest d5ce5066357b
  3. nginx latest 7ce4f91ef623
  4. nfvpe/sriov-cni latest 6ac3016f3d1b
  5. platform9/whereabouts latest a7b49560761b
  6. xagent003/whereabouts latest a7b49560761b

Fixed (#1385079) Fixed an issue where the etcd database size on master nodes had reached 2.1 GB, resulting in etcd to stop serving new read/write requests due to database size saturation. This fix modified the mode to periodic and set the retention value to 10 days.

ETCD_SNAPSHOT_COUNT=10000

ETCD_QUOTA_BACKEND_BYTES=6442450944"

ETCD_AUTO_COMPACTION_MODE="periodic"

ETCD_AUTO_COMPACTION_RETENTION=240h"

Fixed (1385108) Fixed an issue when the password is changed from the UI, then the airctl update-admin-password command needs to be invoked with the password used on the UI to sync the new password on the DU. The airctl get-creds will then give the required username and password. There is no way to sync new password changed in the UI to airctl. However, if the airctl command is used to change the admin password first, then that should also be reflected in the keystone database so that the user do not have to update it through UI.

  1. The password needs to be passed in single quotes through cli.
  2. There are no validation for the password in airctl if passed through cli, so they should set it according to instructions/validations we see on UI.
  3. When airctl update-admin-password is run without a password passed in, a new random password is generated.

Fixed (1385883) Resolved an issue where the docker.repo file is not overwritten as part of a cluster upgrade starting with v1.20. This allows for the managed docker flag to be removed from the kube_override.env file so that PMK will install and start docker.

Known Issues

Known Issue Resolved an issue where the PMK stack upgrade is not sequential due to a race condition between Nodelet & Sunpike/Qbert. In this scenario, the Nodelet provides a prior positive status of the cluster upgrade of the 1st worker node before it ran into a docker mirror fetching issues. Qbert/Sunpike started the 2nd worker node cluster upgrade due to the incorrect positive status of the 1st worker node, resulting in both the worker nodes ending up in a not ready state.

Type to search, ESC to discard
Type to search, ESC to discard
Type to search, ESC to discard