# Troubleshoot VM-to-VM (Same Network) connectivity failures

## Problem

This guide provides instructions for troubleshooting network connectivity failures between two Virtual Machines (VMs) residing on the same logical network (subnet). In an OVN-backed environment, troubleshooting differs significantly depending on whether the two VMs are running on the same physical compute node or different compute nodes.

## Environment

* Private Cloud Director Virtualization - v2025.4 and Higher
* Self-Hosted Private Cloud Director Virtualization - v2025.4 and Higher
* Component - Networking Service<br>

## Deep Dive: Architecture & Packet Flow

To troubleshoot OVN effectively, you must understand the distinction between the "Brain," the "Muscle," and the "Wire," as well as exactly how packets traverse them.

* OVN (The Brain): Runs in the Management Plane. It translates your intent (Logical Switches, Security Groups) into raw instructions.
  * *Northbound DB (`ovn-ovsdb-nb-0`):* Stores intent (Ports, ACLs).
  * *Southbound DB (`ovn-ovsdb-sb-0`):* Stores reality (Chassis bindings, MAC locations).
* OVS (The Muscle): Runs on the compute node (`ovs-vswitchd`). It executes the actual forwarding of packets based on "Flow Rules" across the integration bridge (`br-int`).
* Geneve (The Wire): The UDP tunnel (Port 6081) that encapsulates VM packets for cross-host transport.

### How the Packet Flows (Same Network)

Depending on where the VMs live, the packet takes one of two distinct paths:

* Path A: Same Host (Intra-Host)

  `Source VM` $$\rightarrow$$ `Source Tap` $$\rightarrow$$ `br-int (Source Node)` $$\rightarrow$$ `OVS Connection Tracking (Security Groups)` $$\rightarrow$$ `br-int (Source Node)` $$\rightarrow$$ `Destination Tap` $$\rightarrow$$ `Destination VM`

  *(Note: Traffic never touches the physical network.)*
* Path B: Different Hosts (Inter-Host)

  `Source VM` $$\rightarrow$$ `Source Tap` $$\rightarrow$$ `br-int (Source Node)` $$\rightarrow$$ `Geneve Encapsulation` $$\rightarrow$$ `Physical NIC` $$\rightarrow$$ `Physical Network (UDP 6081)` $$\rightarrow$$ `Physical NIC` $$\rightarrow$$ `Geneve Decapsulation` $$\rightarrow$$ `br-int (Destination Node)` $$\rightarrow$$ `Destination Tap` $$\rightarrow$$ `Destination VM`

## Prerequisites: Executing OVN Commands

Refer to the [existing](https://platform9.com/kb/pcd-ts/networking/troubleshoot-vm-to-vm-different-networks-connectivity-failures#prerequisites-executing-ovn-commands) guide for SaaS Alias setup or Self-Hosted `kubectl exec` instructions to run `ovn-sbctl` and `ovn-nbctl` commands.<br>

## Procedure

### 1. Variable Discovery & Pre-Flight Checks

You cannot troubleshoot flows without exact IDs. Gather these from the Management Plane before tracing.

```shellscript
# 1. Get VM UUIDs
$ openstack server list --all -c ID -c Name

# 2. Get Port IDs, IPs, and MACs
$ openstack port list --server <SOURCE_VM_ID>
$ openstack port list --server <DEST_VM_ID>
$ openstack port show <SOURCE_PORT_ID> -c fixed_ips -c mac_address -c status
$ openstack port show <DEST_PORT_ID> -c fixed_ips -c mac_address -c status

# 3. Check Security Groups (Ensure ICMP/TCP is allowed on Destination)
$ openstack port show <SOURCE_PORT_ID> -c security_group_ids
$ openstack port show <DEST_PORT_ID> -c security_group_ids
$ openstack security group rule list <SG_ID>

# 4. Find which Compute Nodes (Chassis) the VMs are bound to
$ export NS=<WORKLOAD_REGION_NAMESPACE>
$ ovn-sbctl find port_binding logical_port=<SOURCE_PORT_ID> 
$ ovn-sbctl find port_binding logical_port=<DEST_PORT_ID> 
```

* Analysis: If step 4 returns an empty chassis for either VM, the port is unbound and cannot send/receive traffic.
* Logs to Check: On the expected hypervisor, check `/var/log/pf9/ovn/ovn-controller.log`. Look for `claim failed` or `unrecognized port` indicating the `ovn-controller` is refusing to bind the VM to the host.

### 2. Identify the Tap Interface (Data Plane)

Log into the Compute Node hosting the Source VM to find the exact interface name dynamically.

Command (Executed on Source Compute Node):

```shellscript
$ sudo virsh domiflist <SOURCE_VM_ID>

# Note the target interface name from the output (e.g., tapXXXXXXX)
$ export TAP_IFACE="<TARGET_INTERFACE_NAME>"
```

Command (Executed on Destination Compute Node):

```shellscript
$ sudo virsh domiflist <DEST_VM_ID>

# Note the target interface name from the output
$ export DEST_TAP="<TARGET_INTERFACE_NAME>"
```

### 3. Capture at the Source (The Tap)

Goal: Prove the packet is actually leaving the VM. Command (Executed on Source Compute Node):

```shellscript
$ sudo tcpdump -ni $TAP_IFACE icmp
```

* Success: Packets seen (e.g., `IP <SOURCE_IP> > <DEST_IP>: ICMP echo request`). The Guest OS is fine. Proceed to Step 4.
* Failure: No packets seen. The issue is inside the VM's Guest OS (e.g., internal firewall or interface down).

### 4. Physical Datapath Trace (ofproto/trace)

Goal: Ask OVS "Why are you dropping this packet?" based on its programmed flow rules. Command (Executed on Source Compute Node):

```shellscript
# 1. Get the local OVS port number for the Source VM
$ ofport_a=$(sudo ovs-vsctl get interface $TAP_IFACE ofport)

# 2. Run the trace through the integration bridge
$ sudo ovs-appctl ofproto/trace br-int in_port=$ofport_a,dl_src=<SOURCE_MAC>,dl_dst=<DEST_MAC>,ip,nw_src=<SOURCE_IP>,nw_dst=<DEST_IP>,nw_proto=1
```

Analysis: Look at the `Datapath actions:` at the bottom of the output.

* Output: `Action: output:<OFPORT_B>` (Success! Delivered locally to the destination tap).
* Output: `Action: set_tunnel:0x<VNI>, output:<TUNNEL_PORT>` (Success! Sent to the Geneve tunnel for a remote host).
* Drop: `Action: drop`. Note the `cookie=0x...` value in the trace output and proceed to Step 5.
* Logs to Check: If OVS is dropping traffic that OpenStack says should pass, check `/var/log/ovn/ovn-controller.log` for OpenFlow programming errors (`ofctrl_put` errors).

### 5. Cookie Decoding (If Trace Dropped)

Goal: Translate the physical OVS drop back into an OVN logical rule. Command (Executed from Management Plane):

```shellscript
# Query the Southbound DB using the cookie (remove '0x' from the cookie ID)
$ ovn-sbctl list logical_flow | grep <COOKIE_WITHOUT_0X>
```

* Analysis Matrix (Layer 2 Drops): \* `ls_out_acl` / `ls_in_acl` (Security Group Drop): A Neutron Security Group is explicitly denying the traffic.
* `ls_in_port_sec_l2` / `ls_in_port_sec_ip` (Anti-Spoofing Drop): The Source VM is trying to transmit using a MAC or IP address that does not legitimately belong to its port (e.g., nested virtualization or unapproved static IPs).

### 6. Sniff the Tunnels (For Inter-Host Only)

If the VMs are on different hosts and Step 4 said `output to tunnel`, verify packets are physically crossing the wire. Command (Executed on Source & Destination Compute Nodes):

Command (Executed on Source & Destination Compute Nodes):

```shellscript
# Sniff the Geneve tunnel interface for UDP port 6081
$ sudo tcpdump -ni genev_sys_6081 udp port 6081 -vv
```

* Success: Packets leave Source Node and arrive at Destination Node.
* Failure: Packets leave Source Node but do not arrive. A physical firewall or switch ACL is blocking UDP 6081.

### 7. Capture at the Destination (The Tap)

Goal: Prove the OpenStack network successfully delivered the packet to the Destination VM's doorstep. Command (Executed on Destination Compute Node):

```shellscript
$ sudo tcpdump -ni $DEST_TAP icmp
```

* Success: Packets are seen arriving at the tap. If pings still fail, OpenStack networking is perfect; the Destination VM's Guest OS firewall (iptables/Windows Defender) is dropping the traffic.
* Failure: Traces passed and tunnels look good, but packets don't hit the destination tap. Proceed to Step 8.

### 8. Clear Stale FDB Entries (Ghost Traffic)

Goal: If *either* the Source or Destination VM was recently migrated, OVN may be tunneling traffic to the wrong compute node based on a stale Forwarding Database (FDB) entry. Because traffic is bidirectional, a stale entry for either MAC will break communication (forward or return path). Command (Executed from Management Plane):

```shellscript
# 1. Search the Forwarding Database for the MAC of whichever VM recently migrated
$  ovn-sbctl list fdb | grep <MIGRATED_VM_MAC>

# 2. If the entry points to the old chassis, delete it using its UUID (first column)
$  ovn-sbctl destroy fdb <FDB_UUID>
# (OVN will immediately relearn the correct location upon the next packet transmission)
```

* Logs to Check: On both compute nodes, run `tail -f /var/log/ovn/ovn-controller.log | grep pinctrl`. This tracks the MAC learning process. If `pinctrl` is not seeing the Gratuitous ARP (GARP) from the VM, the FDB will not update.

### 9. The Physical Killers (MTU & Offloads)

If traces pass and the tunnel shows traffic, but pings/SSH still fail, the Geneve encapsulation is failing physically. Command (Executed on Compute Nodes):

```shellscript
# 1. Check MTU (Must be >= 1558 if VM is 1500 to account for Geneve overhead)
$ ip link show <PHYSICAL_NIC>

# 2. Disable NIC Offloads (To fix packet header corruption)
$ sudo ethtool -K <PHYSICAL_NIC> tso off gso off gro off tx off rx off
```

* Analysis: If MTU is exactly 1500 on the physical NIC, Geneve packets will exceed the MTU and drop silently during inter-host transit.
* Logs to Check: Run `dmesg -T | grep -i eth` or check `/var/log/syslog` for hardware-level drops or NIC driver errors. Check `/var/log/openvswitch/ovs-vswitchd.log` for `unreasonably large packet` fragmentation warnings.

## Most common causes

* Missing Security Group Rules: The destination VM's Security Group does not explicitly allow the inbound protocol (Ingress).
* Guest OS Firewall: `iptables`, `ufw`, or Windows Defender inside the destination VM is dropping the traffic even though the network delivered it.
* Stale FDB (Ghost Traffic): OVN is sending traffic to the wrong node after a VM migration. This can cause the initial request to drop (Dest migration) or the return reply to drop (Source migration).
* Physical MTU Mismatch: The physical network does not account for Geneve encapsulation overhead (\~58 bytes), causing packets to drop silently during inter-host transit.
* Hardware Offload Corruption: Physical NIC hardware offloading features are corrupting the Geneve tunnel headers.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://platform9.com/kb/pcd-ts/networking/troubleshoot-vm-to-vm-same-network-connectivity-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
