Skip to content

fix: segment GRO/GSO-coalesced packets in PCAP receive path#1780

Merged
midwan merged 1 commit intoBlitterStudio:masterfrom
tbdye:fix/gro-tcp-segmentation
Feb 12, 2026
Merged

fix: segment GRO/GSO-coalesced packets in PCAP receive path#1780
midwan merged 1 commit intoBlitterStudio:masterfrom
tbdye:fix/gro-tcp-segmentation

Conversation

@tbdye
Copy link
Contributor

@tbdye tbdye commented Feb 12, 2026

Summary

PCAP bridged networking delivers ~3-7 KB/s throughput on hosts with GRO, GSO, or virtio NICs — roughly 100x slower than real A2065 hardware. This PR fixes the root cause and restores expected throughput.

The Problem

Linux GRO (Generic Receive Offload), GSO, and virtio mergeable receive buffers coalesce multiple TCP segments into single oversized packets before delivering them to pcap sockets. A burst of 10 standard 1460-byte TCP segments arrives as a single 14,600-byte packet.

The PCAP backend had a hard-coded 1600-byte receive limit (MAX_PSIZE):

#define MAX_PSIZE 1600

static void uaenet_queue(struct uaenet_data *ud, const uae_u8 *data, int len)
{
    if (!ud || len <= 0 || len > MAX_PSIZE)
        return;  // silently dropped

Every coalesced packet was silently dropped. The only packets that survived were single-segment retransmissions (1460 bytes) triggered after the TCP sender's retransmission timeout expired — typically 300ms on LAN, 500ms+ for remote servers. This forced TCP into a degenerate one-segment-per-RTO-cycle mode:

1460 bytes / 300ms = ~4.9 KB/s (LAN)
1460 bytes / 500ms = ~2.9 KB/s (remote)

Disabling GRO/GSO/TSO via ethtool does not help on virtualized hosts (Proxmox/KVM, Hyper-V, etc.) because the hypervisor coalesces packets in the virtual NIC before the guest kernel sees them.

The Fix

Replace the hard drop with TCP segmentation. When the PCAP backend receives a packet larger than a standard Ethernet frame (1514 bytes):

  1. Fast path: Packets <= 1514 bytes pass through directly (zero overhead for normal traffic)
  2. Oversized IPv4/TCP: Parse Ethernet, IP, and TCP headers, then segment the payload into MSS-sized chunks. For each segment, build a complete Ethernet frame with updated IP header (length, identification, checksum), TCP header (sequence number, flags, checksum), and payload
  3. Non-IPv4/non-TCP oversized: Log and drop (GRO only coalesces TCP in practice)

Per-segment details:

  • IP identification: incremented per segment
  • TCP sequence number: advanced by payload offset
  • TCP flags: FIN and PSH preserved only on the final segment
  • Checksums: both IP header and TCP (with pseudo-header) fully recomputed

The receive queue depth is also raised from 10 to 50. A single GRO-coalesced packet segments into 10+ frames, so the old limit would drop most segments from a single burst.

Testing

Tested on Debian 13 (kernel 6.12) running as a Proxmox VM with a virtio NIC. Emulated A4000/040 with A2065 NIC, Roadshow TCP/IP stack. GRO/GSO/TSO left enabled at default settings.

LAN FTP (200KB file):

  • Before: 204,800 bytes in 31 seconds (~6.6 KB/s)
  • After: 204,800 bytes in 0.226 seconds (~885 KB/s) — 134x improvement

Remote FTP (Aminet, transatlantic):

  • Before: 202,765 bytes in 63.4 seconds (~3.2 KB/s)
  • After: 202,765 bytes in 3.37 seconds (~60 KB/s) — 19x improvement

The LAN result (885 KB/s) exceeds real A4000/040 + A2065 benchmarks (400-600 KB/s), which is expected since the emulated 68040 runs at an effective clock speed higher than 25 MHz. The remote result (60 KB/s) is RTT-bound, consistent with Roadshow's 33KB TCP window over a ~170ms path.

Generated with Claude Code

Hosts with GRO, GSO, or virtio NICs deliver TCP segments coalesced
into packets far larger than the Ethernet MTU (up to 64KB). The PCAP
backend silently dropped these because of a 1600-byte hard limit,
forcing TCP into RTO-driven single-segment retransmission and reducing
throughput to ~3-7 KB/s — roughly 100x slower than real hardware.

Add TCP segmentation that splits oversized IPv4/TCP packets back into
MSS-sized Ethernet frames with correct IP/TCP headers and checksums
before queuing them for the A2065 emulation. Raise the receive queue
depth from 10 to 50 to accommodate segmentation bursts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tbdye tbdye requested a review from midwan as a code owner February 12, 2026 05:36
@midwan midwan merged commit 616be65 into BlitterStudio:master Feb 12, 2026
18 of 22 checks passed
@tbdye tbdye deleted the fix/gro-tcp-segmentation branch February 13, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants