Troubleshoot VM-to-VM (Same Network) connectivity failures

Problem

This guide provides instructions for troubleshooting network connectivity failures between two Virtual Machines (VMs) residing on the same logical network (subnet). In an OVN-backed environment, troubleshooting differs significantly depending on whether the two VMs are running on the same physical compute node or different compute nodes.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • Component - Networking Service

Deep Dive: Architecture & Packet Flow

To troubleshoot OVN effectively, you must understand the distinction between the "Brain," the "Muscle," and the "Wire," as well as exactly how packets traverse them.

  • OVN (The Brain): Runs in the Management Plane. It translates your intent (Logical Switches, Security Groups) into raw instructions.

    • Northbound DB (ovn-ovsdb-nb-0): Stores intent (Ports, ACLs).

    • Southbound DB (ovn-ovsdb-sb-0): Stores reality (Chassis bindings, MAC locations).

  • OVS (The Muscle): Runs on the compute node (ovs-vswitchd). It executes the actual forwarding of packets based on "Flow Rules" across the integration bridge (br-int).

  • Geneve (The Wire): The UDP tunnel (Port 6081) that encapsulates VM packets for cross-host transport.

How the Packet Flows (Same Network)

Depending on where the VMs live, the packet takes one of two distinct paths:

  • Path A: Same Host (Intra-Host)

    Source VM \rightarrow Source Tap \rightarrow br-int (Source Node) \rightarrow OVS Connection Tracking (Security Groups) \rightarrow br-int (Source Node) \rightarrow Destination Tap \rightarrow Destination VM

    (Note: Traffic never touches the physical network.)

  • Path B: Different Hosts (Inter-Host)

    Source VM \rightarrow Source Tap \rightarrow br-int (Source Node) \rightarrow Geneve Encapsulation \rightarrow Physical NIC \rightarrow Physical Network (UDP 6081) \rightarrow Physical NIC \rightarrow Geneve Decapsulation \rightarrow br-int (Destination Node) \rightarrow Destination Tap \rightarrow Destination VM

Prerequisites: Executing OVN Commands

Refer to the existingarrow-up-right guide for SaaS Alias setup or Self-Hosted kubectl exec instructions to run ovn-sbctl and ovn-nbctl commands.

Procedure

1. Variable Discovery & Pre-Flight Checks

You cannot troubleshoot flows without exact IDs. Gather these from the Management Plane before tracing.

  • Analysis: If step 4 returns an empty chassis for either VM, the port is unbound and cannot send/receive traffic.

  • Logs to Check: On the expected hypervisor, check /var/log/pf9/ovn/ovn-controller.log. Look for claim failed or unrecognized port indicating the ovn-controller is refusing to bind the VM to the host.

2. Identify the Tap Interface (Data Plane)

Log into the Compute Node hosting the Source VM to find the exact interface name dynamically.

Command (Executed on Source Compute Node):

Command (Executed on Destination Compute Node):

3. Capture at the Source (The Tap)

Goal: Prove the packet is actually leaving the VM. Command (Executed on Source Compute Node):

  • Success: Packets seen (e.g., IP <SOURCE_IP> > <DEST_IP>: ICMP echo request). The Guest OS is fine. Proceed to Step 4.

  • Failure: No packets seen. The issue is inside the VM's Guest OS (e.g., internal firewall or interface down).

4. Physical Datapath Trace (ofproto/trace)

Goal: Ask OVS "Why are you dropping this packet?" based on its programmed flow rules. Command (Executed on Source Compute Node):

Analysis: Look at the Datapath actions: at the bottom of the output.

  • Output: Action: output:<OFPORT_B> (Success! Delivered locally to the destination tap).

  • Output: Action: set_tunnel:0x<VNI>, output:<TUNNEL_PORT> (Success! Sent to the Geneve tunnel for a remote host).

  • Drop: Action: drop. Note the cookie=0x... value in the trace output and proceed to Step 5.

  • Logs to Check: If OVS is dropping traffic that OpenStack says should pass, check /var/log/ovn/ovn-controller.log for OpenFlow programming errors (ofctrl_put errors).

Goal: Translate the physical OVS drop back into an OVN logical rule. Command (Executed from Management Plane):

  • Analysis Matrix (Layer 2 Drops): * ls_out_acl / ls_in_acl (Security Group Drop): A Neutron Security Group is explicitly denying the traffic.

  • ls_in_port_sec_l2 / ls_in_port_sec_ip (Anti-Spoofing Drop): The Source VM is trying to transmit using a MAC or IP address that does not legitimately belong to its port (e.g., nested virtualization or unapproved static IPs).

6. Sniff the Tunnels (For Inter-Host Only)

If the VMs are on different hosts and Step 4 said output to tunnel, verify packets are physically crossing the wire. Command (Executed on Source & Destination Compute Nodes):

Command (Executed on Source & Destination Compute Nodes):

  • Success: Packets leave Source Node and arrive at Destination Node.

  • Failure: Packets leave Source Node but do not arrive. A physical firewall or switch ACL is blocking UDP 6081.

7. Capture at the Destination (The Tap)

Goal: Prove the OpenStack network successfully delivered the packet to the Destination VM's doorstep. Command (Executed on Destination Compute Node):

  • Success: Packets are seen arriving at the tap. If pings still fail, OpenStack networking is perfect; the Destination VM's Guest OS firewall (iptables/Windows Defender) is dropping the traffic.

  • Failure: Traces passed and tunnels look good, but packets don't hit the destination tap. Proceed to Step 8.

8. Clear Stale FDB Entries (Ghost Traffic)

Goal: If either the Source or Destination VM was recently migrated, OVN may be tunneling traffic to the wrong compute node based on a stale Forwarding Database (FDB) entry. Because traffic is bidirectional, a stale entry for either MAC will break communication (forward or return path). Command (Executed from Management Plane):

  • Logs to Check: On both compute nodes, run tail -f /var/log/ovn/ovn-controller.log | grep pinctrl. This tracks the MAC learning process. If pinctrl is not seeing the Gratuitous ARP (GARP) from the VM, the FDB will not update.

9. The Physical Killers (MTU & Offloads)

If traces pass and the tunnel shows traffic, but pings/SSH still fail, the Geneve encapsulation is failing physically. Command (Executed on Compute Nodes):

  • Analysis: If MTU is exactly 1500 on the physical NIC, Geneve packets will exceed the MTU and drop silently during inter-host transit.

  • Logs to Check: Run dmesg -T | grep -i eth or check /var/log/syslog for hardware-level drops or NIC driver errors. Check /var/log/openvswitch/ovs-vswitchd.log for unreasonably large packet fragmentation warnings.

Most common causes

  • Missing Security Group Rules: The destination VM's Security Group does not explicitly allow the inbound protocol (Ingress).

  • Guest OS Firewall: iptables, ufw, or Windows Defender inside the destination VM is dropping the traffic even though the network delivered it.

  • Stale FDB (Ghost Traffic): OVN is sending traffic to the wrong node after a VM migration. This can cause the initial request to drop (Dest migration) or the return reply to drop (Source migration).

  • Physical MTU Mismatch: The physical network does not account for Geneve encapsulation overhead (~58 bytes), causing packets to drop silently during inter-host transit.

  • Hardware Offload Corruption: Physical NIC hardware offloading features are corrupting the Geneve tunnel headers.

Last updated