# Troubleshoot Outside to VM (North-South) Connectivity

## Problem

This guide provides instructions for troubleshooting network connectivity failures between a Virtual Machine (VM) and the External Network (Internet or Datacenter). This includes outbound traffic (SNAT) and inbound access via Floating IPs (DNAT). Traffic must traverse a Neutron Logical Router and a designated Gateway Chassis.

## Environment

* Private Cloud Director Virtualization - v2025.4 and Higher
* Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
* Component - Networking Service

## Deep Dive: Architecture & Packet Flow

North-South traffic differs from East-West because it is not fully distributed. While internal routing happens on the compute node, the transition to the physical external network is pinned to a specific Gateway Node.

* Chassisredirect (The Border): A special OVN port type (`type=chassisredirect`) that centralizes the external gateway logic on a specific physical host to manage NAT and external ARP.
* The Gateway Node: The physical server (e.g., `punhv0059`) where the Logical Router's external leg is physically bound.
* NAT (SNAT/DNAT): The process of translating the VM’s internal IP (`192.xx.xx.xx`) to the Floating IP (`10.xx.xx.xx`).

### How the Packet Flows&#x20;

`Source VM` $$\rightarrow$$ `Source Tap` $$\rightarrow$$ `br-int (Compute Node)` $$\rightarrow$$ `Logical Router (Distributed Leg)` $$\rightarrow$$ `Geneve Tunnel (UDP 6081)` $$\rightarrow$$ `br-int (Gateway Node)` $$\rightarrow$$ `chassis redirect Port` $$\rightarrow$$ `NAT Engine (SNAT Applied)` $$\rightarrow$$ `br-phy1 (External Bridge)` $$\rightarrow$$ `Physical NIC` $$\rightarrow$$ `Datacenter Switch`<br>

## Prerequisites: Executing OVN Commands

Refer to the [existing](https://platform9.com/kb/pcd-ts/networking/troubleshoot-vm-to-vm-different-networks-connectivity-failures#prerequisites-executing-ovn-commands)  guide for SaaS Alias setup or Self-Hosted `kubectl exec` instructions to run `ovn-sbctl` and `ovn-nbctl` commands.

## Procedure

### 1. Variable Discovery & Gateway Identification

Gather the required IDs and locate the physical "Exit Door" for the traffic(Gateway node).

```shellscript
# 1. Get the VM's Port ID 
$ openstack port list --server <VM_ID>


# 2. Get the Network ID of that Port 
$ openstack port show <VM_PORT_ID_FROM_1.1> -c network_id -f value


# 3. Get the Port ID of the Router Interface on that Network 
$ openstack port list --network <NETWORK_ID_FROM_1.2> --device-owner network:router_interface -c ID -f value
# DESIRED OUTPUT: Router Interface Port ID. If empty, Network is not attached to a router.

# 4. Find the Router ID (device_id) 
$ openstack port show <ROUTER_INTERFACE_PORT_ID_FROM_1.3> -c device_id -f value


# 5. Get the External Gateway Port ID for that Router
$ openstack port list --router <ROUTER_ID_FROM_1.4> --device-owner network:router_gateway -c ID -f value
#  If empty, Router has no external gateway.

# 6. Find the Datapath and Chassis UUID
# CORRECT SYNTAX: --columns must come before 'find'
$ ovn-sbctl --columns=datapath,chassis find port_binding logical_port="cr-lrp-<GW_PORT_ID>"
# OUTPUT: 
# datapath: 7f95b7af... (This is your <DATAPATH_UUID> for Step 3)
# chassis : 649d312b... (This is your <CHASSIS_UUID> for Step 2.7)

# 7. Identify the Physical Gateway Node Name. Look for the hostname in the output
$ ovn-sbctl list chassis 649d312b-321d-41f9-a716-8594c886ac98
```

* Analysis: \* If 1.3 is empty: The Network is not attached to a router.
  * If 1.5 is empty: The Router has no external gateway. NAT is impossible.
  * If 1.7 is empty: The OVN database hasn't bound the router to a host.
* Logs to Check (If 1.7 fails): On the expected Gateway Node, check `/var/log/ovn/ovn-controller.log` and grep for `cr-lrp-<GW_PORT_ID>`. Look for `claim failed` or `unrecognized port` errors indicating the node is refusing to host the gateway.

### 2. Identify the Data Plane Interfaces

Log into the Gateway Node to verify the physical bridge mapping.

```shellscript
# 1. Verify Bridge Mappings
$ sudo ovs-vsctl get open . external_ids:ovn-bridge-mappings
# ANALYSIS: If empty or missing your provider net (e.g., pun-lab:br-phy1), OVS cannot reach the physical wire.

# 2. Confirm Physical NIC is a member of the bridge
$ sudo ovs-vsctl show
# ANALYSIS: If br-phy1 does not contain a physical port (e.g., bond0), traffic hits a dead end inside the server.
```

* Success: Output shows your provider net mapped to a bridge (e.g., `pun-lab:br-phy1`), and that bridge contains a physical port (e.g., `bond0`).
* Failure: If mappings are empty, or `br-phy1` lacks a physical interface, traffic hits a dead end inside the server and cannot reach the datacenter switch.
* Logs to Check: `tail -f /var/log/openvswitch/ovs-vswitchd.log` on the Gateway Node. Look for errors related to adding physical ports or bridge initialization failures.

### 3. Logical Simulation (North-South Trace)

Goal: Verify if the OVN "Brain" processes the packet. Because a router has multiple ports, we must pick the Internal one for an outbound trace.

```shellscript
# 1. List all ports for this router to find the Internal LRP. use the <DATAPATH_UUID> obtained from 1.6 
$ ovn-sbctl --columns=logical_port,mac find port_binding datapath="<DATAPATH_UUID>"
# IDENTIFY: 
# - The "Internal LRP" has the VM's gateway IP (e.g., 192.168.2.1).
# - The "External LRP" has the public IP (e.g., 10.22.178.x).

# 2. Run the Trace (Use the Internal LRP as inport)
$ ovn-trace <DATAPATH_UUID> 'inport == "<INTERNAL_LRP_NAME>" && eth.src == <VM_MAC> && eth.dst == <INTERNAL_LRP_MAC> && ip4.src == <VM_IP> && ip4.dst == 8.8.8.8 && ip.ttl == 64 && icmp'
# SUCCESS: Trace ends with "output to localnet".
# FAILURE: Trace ends with "drop".
```

The output reveals the exact logical table where the packet was killed. Use this matrix:

* ls\_out\_acl / ls\_in\_acl (Security Group Drop): A Neutron Security Group is explicitly denying the traffic.
* lr\_in\_ip\_routing (Routing Drop): The Logical Router has no route to the destination. Verify the External Gateway is set.
* lr\_in\_arp\_resolve (ARP Drop): The router cannot resolve the MAC of the next-hop (physical switch). Check upstream ARP.
* ls\_in\_port\_sec (Anti-Spoofing Drop): The VM is trying to use a MAC/IP that does not belong to its port.
* SUCCESS: Trace shows `ct_snat(FLOATING_IP)` and ends with `output to localnet`.
* Failure (Drop): Trace ends with `drop`. The logical configuration is blocking the packet (e.g., missing route, Security Group). Proceed to Step 6.

### 4. Physical Validation (Gateway Node)

Log into the Gateway Node to prove the packet hits the wire.

{% hint style="info" %}
the Gateway node can be identified using step 1
{% endhint %}

```shellscript
# 1. Check arrival from VM via Tunnel
$ sudo tcpdump -ni genev_sys_6081 host <VM_INTERNAL_IP>
# ANALYSIS: If empty, the packet is being dropped on the COMPUTE node.

# 2. Check exit to physical wire (NATed)
$ sudo tcpdump -ni br-phy1 host <FLOATING_IP>
# ANALYSIS SUCCESS: You see "FLOATING_IP > 8.8.8.8". NAT is working.
# ANALYSIS FAILURE: You see "INTERNAL_IP > 8.8.8.8" (NAT failed) or nothing (OVS Drop).
```

* Success: Packets arrive via tunnel (4.1) and exit `br-phy1` with the translated Floating IP (4.2). NAT is working perfectly.
* Failure (No Arrival): 4.1 is empty. Packets are dropping on the Compute Node (Check SG egress or MTU).
* Failure (No Exit): 4.1 shows traffic, but 4.2 is empty. OVS is dropping the packet internally on the Gateway. Proceed to Step 5.

### 5. Advanced Physical Trace (ofproto/trace)

If Phase 4.2 is empty, ask OVS why it is dropping the packet internally on the Gateway Node.To run a physical `ofproto/trace`, we need the Geneve tunnel metadata and the correct input port.

```shellscript
# 5.1 Get the Router VNI (tun_id)
$ ovn-sbctl --columns=tunnel_key find datapath_binding datapath="<DATAPATH_UUID_FROM_1.6>"
# Convert result to HEX (e.g., 88 -> 0x58). This is your <VNI_HEX>.

# 5.2 Get the Source Port Key (Internal LRP)
$ ovn-sbctl --columns=tunnel_key find port_binding logical_port="<INTERNAL_LRP_NAME_FROM_2.1>"
# This is your <SRC_KEY> (e.g., 3).

# 5.3 Get the Destination Port Key (Chassisredirect/Gateway Port)
$ ovn-sbctl --columns=tunnel_key find port_binding logical_port="cr-lrp-<GW_PORT_ID_FROM_1.5>"
# This is your <DST_KEY> (e.g., 2).

# 5.4 Find the Input ofport (Run on Gateway Node)
# First, find the IP of the host where the VM is living:
$ openstack server show <VM_ID> -c "OS-EXT-SRV-ATTR:host" -f value 
# Then get the IP for that host
# Now, find which tunnel port on the Gateway points to that IP:
$ sudo ovs-vsctl find interface type=geneve options:remote_ip="<COMPUTE_NODE_IP>" | grep ofport


# 5.5 Metadata Construction: 0x<SRC_KEY>000<DST_KEY> 

# 5.6 Finally run the trace
$ sudo ovs-appctl ofproto/trace br-int in_port=<OFPORT_FROM_5.4>,tun_id=<VNI_HEX_FROM_5.1>,tun_metadata0=0x<SRC_KEY_FROM_5.2>000<DST_KEY_FROM_5.3>,dl_src=<VM_MAC>,dl_dst=<INTERNAL_LRP_MAC_FROM_3.1>,ip,nw_src=<VM_IP>,nw_dst=8.8.8.8,nw_proto=1
```

* Success: Datapath actions show `set_tunnel` or `output:<ID>` to a patch port leading to `br-phy1`.
* Failure (Drop): Action is `drop`. Note the `cookie=0x...` value in the trace output and proceed to Step 6.
* Failure (Flooded): Action outputs to local `tap` interfaces or other tunnels. OVN is confused about where the physical exit is (usually a bridge mapping mismatch).
* Logs to Check: If OVS actions don't match OVN intent, check `/var/log/pf9/ovn/ovn-controller.log` on the Gateway Node for flow programming errors (`ofctrl_put` errors).

### 6. Cookie Decoding (Root Cause Analysis)

If Step 5.5 dropped the packet, map the `relevant_cookie` back to the logical rule.

```shellscript
# Run in OVN SB Pod
$ ovn-sbctl lflow-list | grep <COOKIE_WITHOUT_0X>
```

* ls\_out\_acl / ls\_in\_acl: A Security Group is explicitly denying the traffic.
* lr\_in\_ip\_routing: The Router has no route to the destination. Verify the External Gateway is set.
* lr\_in\_arp\_resolve: The Gateway cannot resolve the MAC of the physical switch. Check upstream ARP.
* ls\_in\_port\_sec: The VM is spoofing its MAC/IP.

### 7. Clear Stale FDB Entries (Ghost Traffic)

If the Gateway Node was recently migrated or the Chassis binding changed, traffic may be sent to a "ghost" location.

```shellscript
# 1. Search the FDB for the Floating IP or Gateway MAC
$ ovn-sbctl list fdb | grep <FLOATING_IP_OR_MAC>

# 2. If the entry points to the wrong chassis, delete it
$ ovn-sbctl destroy fdb <FDB_UUID>
```

* Success: Traffic immediately resumes once the stale entry is cleared.
* Logs to Check: On the Gateway Node, `tail -f /var/log/pf9/ovn/ovn-controller.log | grep pinctrl`. This tracks the ARP learning and FDB updates natively from the physical wire.

### 8. Physical Killers (MTU & Offloads)

If traces pass but traffic fails, the physical NIC may be corrupting large Geneve-encapsulated packets.

```shellscript
# 1. Check MTU (External Bridge and Physical NIC must be 1558+ to support VM 1500)
$ ip link show br-phy1
$ ip link show <PHYSICAL_NIC>

# 2. Disable NIC Offloads (Run on Gateway and Compute Nodes)
$ sudo ethtool -K <PHYSICAL_NIC> tso off gso off gro off tx off rx off
```

* Analysis: If MTU is 1500 on the physical NIC, Geneve packets (which add 58 bytes of overhead) will be fragmented or dropped by the physical switch.
* Logs to Check: \* Run `dmesg -T | grep -i eth` or check `/var/log/syslog` for hardware-level drops, CRC errors, or driver crashes related to the physical NIC (`bond0`).
  * Check `/var/log/openvswitch/ovs-vswitchd.log` for `unreasonably large packet` or `fragmentation` warnings.

## Most Common Causes

* Missing External Gateway: The Logical Router was created but never assigned an external gateway port. NAT cannot occur without an "Exit Door."
* Asymmetric Routing (Gateway Migration): The Gateway Chassis migrated to a new node, but the physical datacenter switch is still sending return traffic to the old node's MAC address because it missed the Gratuitous ARP (GARP).
* SNAT/DNAT Rule Mismatch: The VM has a Floating IP, but the Logical Router’s NAT table is missing the corresponding entry. OVN will route the packet but will not translate the source IP, causing the physical firewall to drop it as "spoofed."
* Stale FDB (Ghost Traffic): OVN is still tunneling traffic to the previous Gateway Node after a failover event. This "blackholes" all external traffic until the FDB entry is cleared or times out.
* Provider Network Bridge Mapping: The physical bridge (e.g., `br-phy1`) on the Gateway Node is not mapped to the correct `physical_network` name in OVS `external_ids`.
* Physical MTU Mismatch: North-South traffic often fails for Large Packets (HTTP/Downloads) because the Geneve overhead (58 bytes) makes the packet exceed the 1500 MTU of the physical datacenter switches.
* Upstream MAC Filtering: The physical switch port connected to the Gateway Node is configured with "Port Security" or a "MAC Limit" that prevents it from learning the virtual MAC addresses of the Floating IPs.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd-ts/networking/troubleshoot-outside-to-vm-north-south-connectivity.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
