Troubleshoot VM-to-VM (Different Networks) connectivity failures

Problem

This guide provides instructions for troubleshooting network connectivity failures between two Virtual Machines (VMs) residing on different logical networks or subnets . Traffic must cross a Neutron Logical Router.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • Component - Networking Service

Deep Dive: Architecture & Packet Flow

To troubleshoot OVN effectively, you must understand the distinction between the "Brain," the "Muscle," and the "Wire," as well as exactly how packets traverse them.

  • OVN (The Brain): Runs in the Management Plane. It translates your intent (Logical Routers, Switches, Security Groups) into raw instructions.

    • Northbound DB (ovn-ovsdb-nb-0): Stores intent (Routers, Ports, ACLs).

    • Southbound DB (ovn-ovsdb-sb-0): Stores reality (Chassis bindings, MAC/ARP Learning).

  • OVS (The Muscle): Runs on the compute node (ovs-vswitchd). It executes the actual forwarding of packets based on "Flow Rules."

  • Geneve (The Wire): The UDP tunnel (Port 6081) that encapsulates VM packets for cross-host transport.

How the Packet Flows (Routed / DVR)

Unlike Layer 2 traffic, routed traffic relies on Distributed Virtual Routing (DVR) in OVN. The routing decision and MAC address rewrite happen locally on the source compute node's virtual switch.

The Packet Flow:

Source VM \rightarrow Source Tap \rightarrow br-int (Source Node) \rightarrow OVN Logical Router (Distributed) \rightarrow Geneve Encapsulation \rightarrow Physical NIC \rightarrow Physical Network (UDP 6081) \rightarrow Physical NIC \rightarrow Geneve Decapsulation \rightarrow br-int (Dest Node) \rightarrow Dest Tap \rightarrow Dest VM

Prerequisites: Executing OVN Commands

Depending on your deployment model, your access to the OVN databases differs. Please refer to the correct execution method for your environment.

For SaaS Environment

To run ovn-* commands, you must execute them from the onboarded Compute Nodes. Create an environment file ovs-alias.rc to route commands directly to the central databases:

Export the rc file and start using the OVN commands natively on the compute node:

For Self-Hosted Environment

Self-Hosted users have direct access to the Management Cluster and can execute commands from inside the OVN Southbound Pod.

Procedure

1. Variable Discovery & Pre-Flight Checks

Gather the required IDs and verify logical health from the Management Plane before logging into the hypervisors.

Command (Executed from Management Plane):

  • Analysis: If step 6 returns an empty chassis, the VM port is unbound.

  • Logs to Check: Check /var/log/pf9/ovn/ovn-controller.log on the expected hypervisor. Look for claim failed or unrecognized port indicating the node refuses to bind the VM.

2. Identify the Tap Interfaces (Data Plane)

Log into the specific Compute Nodes hosting the VMs to find their exact interface names.

Command (Executed on Source Compute Node):

Command (Executed on Destination Compute Node):

3. Logical Simulation (Intent Trace)

Goal: Verify if the OVN Brain allows the traffic and successfully routes it to the destination subnet.

  • Success: Output shows lr_in_ip_routing (Routing succeeds), lr_in_arp_resolve (MAC rewrite logic), and a final output action.

  • Failure (Drop / Flooded): If dropped, check Static Routes on the Logical Router, or verify both subnets are actually attached to the router. If flooded to _MC_unknown, ensure you used the Gateway MAC for eth.dst, not the Destination VM MAC.

4. Verify Router ARP (MAC Binding)

Goal: If the logical trace fails during routing, ensure the router successfully resolved the MAC address of the destination VM.

  • Success: Shows the MAC address of the destination VM.

  • Failure: Empty. The destination VM is down, its Guest OS is filtering ARP requests, or DHCP/Metadata agents are failing.

  • Logs to Check: Check /var/log/ovn/ovn-controller.log on the destination node and grep pinctrl. This thread handles the generation and processing of ARP requests.

5. Capture at the Source (The Tap)

Goal: Prove the packet is actually leaving the Source VM and destined for its Default Gateway. Command (Executed on Source Compute Node):

  • Success: Packets seen. Proceed to Step 6.

  • Failure: No packets seen. Check the Guest OS routing table (e.g., default gateway is missing or incorrect inside the VM).

6. Physical Datapath Trace (ofproto/trace)

Goal: Ask OVS why it is dropping the routed packet based on actual flow rules. Command (Executed on Source Compute Node):

  • Success: Datapath actions show Action: set_tunnel:0x<VNI>, output:<TUNNEL_PORT> (Routed and sent to the Geneve tunnel).

  • Failure (Drop): Action shows drop. Note the cookie=0x... value in the trace output and proceed to Step 7.

  • Logs to Check: If OVS actions don't match OVN intent, check /var/log/pf9/ovn/ovn-controller.log for flow programming errors (ofctrl_put errors).

Goal: Translate the physical OVS drop back into an OVN logical rule to pinpoint the exact OpenStack misconfiguration.

The output will reveal the exact logical table where the packet was killed. Use this matrix to identify the root cause:

  • ls_out_acl / ls_in_acl (Security Group Drop): A Neutron Security Group is explicitly denying the traffic. Check Egress rules on the source and Ingress rules on the destination.

  • lr_in_ip_routing (Routing Drop): The Logical Router has no route to the destination IP. Verify that both subnets are properly attached to the router.

  • lr_in_arp_resolve (ARP Drop): The router knows the route, but cannot resolve the MAC address of the destination VM. Check if the destination VM is powered off, or if its Guest OS is dropping ARP requests.

  • ls_in_port_sec_l2 / ls_in_port_sec_ip (Anti-Spoofing Drop): The Source VM is trying to transmit traffic using a MAC or IP address that does not legitimately belong to its port.

8. Sniff the Tunnels (Inter-Host)

Goal: Verify routed packets are physically crossing the wire. Command (Executed on Source & Destination Compute Nodes):

  • Success: Packets leave Source Node and arrive at Destination Node.

  • Failure: Packets leave Source Node but do not arrive. A physical firewall or switch ACL is blocking UDP 6081.

9. Capture at the Destination (The Tap)

Goal: Prove the OpenStack network successfully routed and delivered the packet to the Destination VM.

  • Success: Packets arriving. If pings still fail, the Destination VM's Guest OS firewall is dropping traffic from the remote subnet.

  • Failure: Traces passed and tunnels look good, but packets don't hit the dest tap. Proceed to Step 10.

10. Clear Stale FDB Entries (Ghost Traffic)

Goal: If either the Source or Destination VM was recently migrated, OVN may be tunneling traffic to the wrong compute node based on a stale Forwarding Database (FDB) entry. Because traffic is bidirectional, a stale entry for either MAC will break communication.

  • Logs to Check: On the compute nodes, tail -f /var/log/ovn/ovn-controller.log | grep pinctrl to monitor MAC learning updates.

11. The Physical Killers (MTU & Offloads)

If traces pass, tunnels show traffic, and the FDB is correct, but packets still drop, the Geneve encapsulation is failing physically. Command (Executed on Compute Nodes):

Logs to Check: * Run dmesg -T | grep -i eth or check /var/log/syslog for hardware-level drops or NIC driver errors.

  • Check /var/log/openvswitch/ovs-vswitchd.log for unreasonably large packet fragmentation warnings.

Most common causes

  • Missing Router Interface: The destination subnet was never attached to the logical router.

  • Asymmetric Security Groups: The source VM can egress, but the destination VM's Security Group drops the ingress traffic from the remote subnet.

  • Unresolved ARP: The Logical Router does not have a mac_binding for the destination VM, dropping the packet during the routing phase.

  • Stale FDB (Ghost Traffic): OVN is sending traffic to the wrong node after a VM migration. This can cause the initial request to drop (Dest migration) or the return reply to drop (Source migration).

  • Physical MTU Mismatch: Routing is successful, but the encapsulated packet is too large for the physical switch ports connecting the hypervisors.

  • Guest OS Routing/Firewall: The internal OS firewall is dropping traffic because it originates from an "untrusted" or remote subnet.

Last updated