I am aware that this may not be optimal as any IP packet from the server >1420 will need to be fragmented.
The core of the problem is that the server doesn't know such packets need to be fragmented.
The typical TCP handshake Num 1-3, then progrably ClientHello (Num 4-5) and ServerHello (Num 6-7). No packet size comes close to the MTU and no other ICMP messages that would indicate issues with fragmentation etc.
There are large packets – such as the TLS Certificate
message from the server4 – but your capture is not seeing them because they are larger than your MTU, so they never arrive at your end. That is literally the problem; if those packets reached your network interface (such that they became visible on a packet capture), then the connection wouldn't hang.
The capture needs to be done on the "upstream" end of the tunnel, specifically on the ingress interface that is one step before the 'low MTU' interface. So if the path is "internet → server eth0 ⇒ server wg0 → client wg-foo ⇒ client ether1", then the large packets will be visible on "server eth0" but won't fit into "server wg0". Capturing on wg0 would therefore give you nothing, but capturing on eth0 would likely show a series of:
you --> VPS --> API TCP SYN
<-- <-- TCP ACK
--> --> small packet
<-- <-- small packet
X<-- big packet
--> ICMP frag needed
X<-- big packet
--> ICMP frag needed
X<-- big packet
--> ICMP frag needed
...
(Note that hardware receive offload might give confusing results, as your Ethernet NIC might coalesce segments into one super-packet, e.g. when capturing on the end-host itself. If you see packets over 2kB in size, you may need to ethtool -K eth0 gso off gro off
for the duration of the capture.)
why would it help if I reduce the MTU on the client?
During the TCP handshake, the client (both peers really) declares a TCP MSS – maximum TCP segment size – that it can receive. Since the client usually has infinite memory nowadays1 and is not limited to tiny segments, it really offers the largest MSS which it calculates as optimal for the MTU that it knows, in order to avoid the need for IP-level fragmentation.
For example, if your Ethernet interface's MTU is 1500 then your OS might offer a MSS of 1460 which exactly fits within the IP payload (assuming IPv4 overhead of 20 and TCP overhead of 20 again, in the most simplest case).
So reducing the MTU of the client's network interface will lead to it declaring a smaller acceptable TCP segment size upfront, which causes the server to always send smaller IP packets (i.e. staying below the limit at which fragmentation would become required), just as if you had reduced the server's MTU.
With the default 1500 MTU, meanwhile, the server will send large segments in large IP packets, until it receives an ICMP "Fragmentation needed" from your ISP's gateway (the one that has the low-MTU link towards you and is unable to forward those packets to you); then the server will note the new PMTU towards you and will start sending those segments fragmented at IP level.5
But if any firewall prevents3 that ICMP error from reaching the server, this won't happen and the server will forever try sending that TCP segment in the same large IP packet. (Or, if the server is behind a certain type2 of firewall which reassembles and re-fragments all IP packets going through it, then it might be fragmenting the packet correctly but the firewall could be undoing all its work.)
Gateways, such as Linux with nftables/iptables, often have the feature to patch the advertised MSS of TCP handshakes going through them in order to fit the MTU that the gateway knows, e.g. when the client is on a 1500-byte MTU Ethernet but the gateway is about to forward the packet through an 1420-byte MTU PPPoE tunnel:
However, this is agnostic of any L4 protocol. SSL over TCP should not care about fragmentation and MTU.
If my understanding is correct, TCP has to care about MTU, because relying on IP fragmentation reduces the efficiency of TCP retransmissions – if even a single fragment is 'lost' then the entire IP packet is 'lost' and none of it gets delivered to the upper layer protocol.
For example, if TCP sent a 64k segment that was fragmented into 45 IP datagrams and one of them got lost, then all of them would need to be retransmitted after the ICMP "Reassembly time exceeded". (This is assuming fragmentation works at all, which as you see sometimes doesn't.3)
Whereas with the same 64k of data divided into TCP segments that fit within the IP MTU, the other 44 IP packets would still be delivered to the recipient's TCP layer and SACK'ed and only the lost one would have to be retransmitted (which I think might even happen ~immediately after the server receives a SACK that indicates a hole, instead of a long reassembly timeout).
4 The 'Certificate' messages were visible in the clear with TLSv1.2, but are encrypted with TLSv1.3 so a capture will only see them as 'Application Data'.
1 Or so most developers assume.
2 Such as Untangle in its default "brouting" mode.
3 Also known as a "PMTUD black hole". See e.g. Cloudflare blog post #1 and #2 and #3 for one situation where it happens for reasons other than a sysadmin blanket-blocking ICMP.
5 I don't actually know whether it fragments the same segments at IP level or whether it reduces its TCP MSS for that connection as well. It might actually be the latter.
tracepath
.