I would like to write about Path MTU discovery and IP Fragmentation in this post and the relation between them.
As per the topology above, if the host JUNOS1 (note that: all these devices are Linux indeed not JunOS devices, I just couldn’t change hostnames in GNS due to a problem) is sending a packet to JUNOS3 device. Packet has to go through a path in which there are various MTU sizes involved. In the past I used to think that Path MTU discovery concept is something which is done before TCP communication starts and detects the the lowest link speed and according to which TCP segments are generated. It isn’t the way how it works. How it works is;
Assume packet, which is leaving JUNOS1 has total length of 1450bytes. Because the link between JUNOS1-JUNOS2 has 1500byte limit, there is no problem. However, once JUNOS2 receives the packet, it sees that the link that it must use to forward this packet has a lower maximum packet capacity than the packet it has. Under normal circumstances, JUNOS2 sends back an ICMP notification to JUNOS1 and says that “Hey dude, I can’t forward this packet as I have a link having 800byte MTU on the way, do something and lower your packet size”
JUNOS1 gets this ICMP and lowers its further packets’ maximum sizes to 800 then the packets flow through. Ok, fair enough so far but there is a concept of IP Fragmentation, why doesn’t it occur? This is what documents say if the next link MTU is lower than the packet being forwarded, packets are fragmented.
Now the Path MTU discovery comes in;
net.ipv4.ip_no_pmtu_disc = 0
This setting in JUNOS1 device causes every packet to have DF (Don’t Fragment) bit to be set to 1 which means, don’t let any intermediate router to fragment this packet. Below is a screenshot indicating how you will see packet during a capture in Wireshark. Can you see the Don’t fragment bit?
Because fragmentation isn’t allowed by the sender device, first intermediate router that has a lower MTU link sends an ICMP response back. Lets see what kind of ICMP message JUNOS1 receives (you can click on the image to see a bigger version)
As you can see TCP three way communication is done properly but once JUNOS1 tries to send a segment above 800byte (in the output it isn’t visible but 289 and 290. packets’ total lengths are 1500 for each) it receives the above ICMP response (Destination unreachable Fragmentation needed) from JUNOS2 and it lowers further packets’ sizes to fit in 800byte limit.
Now the question, does this communication always work like this? I mean every time a new TCP communication is needed, this process is repeated? Not really. Linux does cache this path MTU. Lets see it;
Can you see it? Now JUNOS1 linux knows that it shouldn’t send any packet bigger than 800 byte if it wants to send a packet for this destination again. This cache expires in 596sec as it can be seen in the output. I have noticed that even if you have packets flowing in this direction, expire value continues to count down to zero. So having an active connection doesn’t mean that this value will be reset to its upper limit again and again.
During my troubleshooting, I asked myself what happens if I just block every ICMP packet sent from JUNOS2 device. The answer is communication halts!!! because JUNOS2 doesn’t provide any feedback about the next link MTU and JUNOS1 keeps sending its packets still at 1500byte. Since DF bit is set, fragmentation can’t happen and everything is stuck. This is a very bad thing indeed!
I again asked another question, what can I do from JUNOS3 side to prevent this from happening if I can’t inform JUNOS1 admin. MSS (Maximum Segment Size) comes in this situation.
MSS isn’t a negotiated value indeed due to which what ever JUNOS3 tells the other peer during TCP communication, JUNOS1 must obey that. What I did was, I set advertised MSS value for this link to 700 in JUNOS3
JUNOS3# ip route change 0.0.0.0/0 dev eth0 advmss 700
After this all the subsequent TCP SYN packets will advertise its MMS as 700 and because JUNOS1 will obey this and arrange the packet size according to it, packet flow will not be disrupted.
I hope I haven’t made a mistake so far in my post. Let me know if you have any contribution or questions.
Path MTU discovery in JunOS:
If you want to enable/disable Path MTU in SRX, the following output should be enough I think:
Note1: I was wondering how TCP keeps its per connection variables. For example MSS is only announced during connection establishment but no where else but then the entire connection knows that MSS has such a value. I think Transmission Control Block RFC http://www.ietf.org/rfc/rfc2140.txt is the key to this question.
Note2: I have discovered some behavior after 1 year that this post was published. The setting net.ipv4.ip_no_pmtu_disc is only working if you are the TCP sender. At least in my test on Ubuntu. For example if you are the web server (responder), net.ipv4.ip_no_pmtu_disc setting has no effect , no matter what you do every IP packet has DF bit is set to 1 in TCP segments (replies). I don’t know the reason why we can’t manipulate this behavior. There may be a reason or an option to change it but couldn’t find it yet.