Troubleshoot Outside to VM (North-South) Connectivity

Problem

This guide provides instructions for troubleshooting network connectivity failures between a Virtual Machine (VM) and the External Network (Internet or Datacenter). This includes outbound traffic (SNAT) and inbound access via Floating IPs (DNAT). Traffic must traverse a Neutron Logical Router and a designated Gateway Chassis.

Environment

  • Private Cloud Director Virtualization - v2025.4 and Higher

  • Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher

  • Component - Networking Service

Deep Dive: Architecture & Packet Flow

North-South traffic differs from East-West because it is not fully distributed. While internal routing happens on the compute node, the transition to the physical external network is pinned to a specific Gateway Node.

  • Chassisredirect (The Border): A special OVN port type (type=chassisredirect) that centralizes the external gateway logic on a specific physical host to manage NAT and external ARP.

  • The Gateway Node: The physical server (e.g., punhv0059) where the Logical Router's external leg is physically bound.

  • NAT (SNAT/DNAT): The process of translating the VM’s internal IP (192.xx.xx.xx) to the Floating IP (10.xx.xx.xx).

How the Packet Flows

Source VM \rightarrow Source Tap \rightarrow br-int (Compute Node) \rightarrow Logical Router (Distributed Leg) \rightarrow Geneve Tunnel (UDP 6081) \rightarrow br-int (Gateway Node) \rightarrow chassis redirect Port \rightarrow NAT Engine (SNAT Applied) \rightarrow br-phy1 (External Bridge) \rightarrow Physical NIC \rightarrow Datacenter Switch

Prerequisites: Executing OVN Commands

Refer to the existing guide for SaaS Alias setup or Self-Hosted kubectl exec instructions to run ovn-sbctl and ovn-nbctl commands.

Procedure

1. Variable Discovery & Gateway Identification

Gather the required IDs and locate the physical "Exit Door" for the traffic(Gateway node).

  • Analysis: * If 1.3 is empty: The Network is not attached to a router.

    • If 1.5 is empty: The Router has no external gateway. NAT is impossible.

    • If 1.7 is empty: The OVN database hasn't bound the router to a host.

  • Logs to Check (If 1.7 fails): On the expected Gateway Node, check /var/log/ovn/ovn-controller.log and grep for cr-lrp-<GW_PORT_ID>. Look for claim failed or unrecognized port errors indicating the node is refusing to host the gateway.

2. Identify the Data Plane Interfaces

Log into the Gateway Node to verify the physical bridge mapping.

  • Success: Output shows your provider net mapped to a bridge (e.g., pun-lab:br-phy1), and that bridge contains a physical port (e.g., bond0).

  • Failure: If mappings are empty, or br-phy1 lacks a physical interface, traffic hits a dead end inside the server and cannot reach the datacenter switch.

  • Logs to Check: tail -f /var/log/openvswitch/ovs-vswitchd.log on the Gateway Node. Look for errors related to adding physical ports or bridge initialization failures.

3. Logical Simulation (North-South Trace)

Goal: Verify if the OVN "Brain" processes the packet. Because a router has multiple ports, we must pick the Internal one for an outbound trace.

The output reveals the exact logical table where the packet was killed. Use this matrix:

  • ls_out_acl / ls_in_acl (Security Group Drop): A Neutron Security Group is explicitly denying the traffic.

  • lr_in_ip_routing (Routing Drop): The Logical Router has no route to the destination. Verify the External Gateway is set.

  • lr_in_arp_resolve (ARP Drop): The router cannot resolve the MAC of the next-hop (physical switch). Check upstream ARP.

  • ls_in_port_sec (Anti-Spoofing Drop): The VM is trying to use a MAC/IP that does not belong to its port.

  • SUCCESS: Trace shows ct_snat(FLOATING_IP) and ends with output to localnet.

  • Failure (Drop): Trace ends with drop. The logical configuration is blocking the packet (e.g., missing route, Security Group). Proceed to Step 6.

4. Physical Validation (Gateway Node)

Log into the Gateway Node to prove the packet hits the wire.

circle-info

the Gateway node can be identified using step 1

  • Success: Packets arrive via tunnel (4.1) and exit br-phy1 with the translated Floating IP (4.2). NAT is working perfectly.

  • Failure (No Arrival): 4.1 is empty. Packets are dropping on the Compute Node (Check SG egress or MTU).

  • Failure (No Exit): 4.1 shows traffic, but 4.2 is empty. OVS is dropping the packet internally on the Gateway. Proceed to Step 5.

5. Advanced Physical Trace (ofproto/trace)

If Phase 4.2 is empty, ask OVS why it is dropping the packet internally on the Gateway Node.To run a physical ofproto/trace, we need the Geneve tunnel metadata and the correct input port.

  • Success: Datapath actions show set_tunnel or output:<ID> to a patch port leading to br-phy1.

  • Failure (Drop): Action is drop. Note the cookie=0x... value in the trace output and proceed to Step 6.

  • Failure (Flooded): Action outputs to local tap interfaces or other tunnels. OVN is confused about where the physical exit is (usually a bridge mapping mismatch).

  • Logs to Check: If OVS actions don't match OVN intent, check /var/log/pf9/ovn/ovn-controller.log on the Gateway Node for flow programming errors (ofctrl_put errors).

If Step 5.5 dropped the packet, map the relevant_cookie back to the logical rule.

  • ls_out_acl / ls_in_acl: A Security Group is explicitly denying the traffic.

  • lr_in_ip_routing: The Router has no route to the destination. Verify the External Gateway is set.

  • lr_in_arp_resolve: The Gateway cannot resolve the MAC of the physical switch. Check upstream ARP.

  • ls_in_port_sec: The VM is spoofing its MAC/IP.

7. Clear Stale FDB Entries (Ghost Traffic)

If the Gateway Node was recently migrated or the Chassis binding changed, traffic may be sent to a "ghost" location.

  • Success: Traffic immediately resumes once the stale entry is cleared.

  • Logs to Check: On the Gateway Node, tail -f /var/log/pf9/ovn/ovn-controller.log | grep pinctrl. This tracks the ARP learning and FDB updates natively from the physical wire.

8. Physical Killers (MTU & Offloads)

If traces pass but traffic fails, the physical NIC may be corrupting large Geneve-encapsulated packets.

  • Analysis: If MTU is 1500 on the physical NIC, Geneve packets (which add 58 bytes of overhead) will be fragmented or dropped by the physical switch.

  • Logs to Check: * Run dmesg -T | grep -i eth or check /var/log/syslog for hardware-level drops, CRC errors, or driver crashes related to the physical NIC (bond0).

    • Check /var/log/openvswitch/ovs-vswitchd.log for unreasonably large packet or fragmentation warnings.

Most Common Causes

  • Missing External Gateway: The Logical Router was created but never assigned an external gateway port. NAT cannot occur without an "Exit Door."

  • Asymmetric Routing (Gateway Migration): The Gateway Chassis migrated to a new node, but the physical datacenter switch is still sending return traffic to the old node's MAC address because it missed the Gratuitous ARP (GARP).

  • SNAT/DNAT Rule Mismatch: The VM has a Floating IP, but the Logical Router’s NAT table is missing the corresponding entry. OVN will route the packet but will not translate the source IP, causing the physical firewall to drop it as "spoofed."

  • Stale FDB (Ghost Traffic): OVN is still tunneling traffic to the previous Gateway Node after a failover event. This "blackholes" all external traffic until the FDB entry is cleared or times out.

  • Provider Network Bridge Mapping: The physical bridge (e.g., br-phy1) on the Gateway Node is not mapped to the correct physical_network name in OVS external_ids.

  • Physical MTU Mismatch: North-South traffic often fails for Large Packets (HTTP/Downloads) because the Geneve overhead (58 bytes) makes the packet exceed the 1500 MTU of the physical datacenter switches.

  • Upstream MAC Filtering: The physical switch port connected to the Gateway Node is configured with "Port Security" or a "MAC Limit" that prevents it from learning the virtual MAC addresses of the Floating IPs.

Last updated