Recently I had to redo the design of the machine with our public websites, and after an earlier successful implementation of virtualisation with FreeBSD jails, I decided to put them all in their own private jail, with their own public IP address, too.
Since I'm a firm believer in "eat your own stuff" and my website was on the list of sites to be moved, I decided to do that one first. The IP range we have for it was 126.96.36.199/24, and since the first half of it was already in use by other services, I started to go down from 255.
To make life easier for us, we use a lot of dynamic routing in our network. Also with jails: They're defined on the loopback interfaces and the subnet masks are all /32's. The combination of these two should make it easy to move them around if necessary without having to worry about physical machines and subnets and DNS.
So, we have this new webserver (my webserver, so somehow important to me) on 188.8.131.52 and it seems to work fine. I can access it from inside the network, I can access it from outside the network, I see webbrowsers and spiders connecting to it. Life is good!
Except... I get reports from people saying that they can't get to my website, that there is some kind of DNS error: Cannot find server or DNS error is what Internet Explorer tells them. I ask them: "Can you ping the machine? "No that's not workin." "Can you telnet to it?" "No, it says Connect failed.". I don't see anything in the logs, I don't see anything on the network. No idea what goes wrong here...
Finally I get the same message from friends who have elite skillz in the ancient arts of ping, traceroute, telnet and tcpdump (Hi dvl, koitsu!). And we start trying: Yes, we can ping 184.108.40.206, so there is nothing wrong on the end-to-end network layer. No, we can't ping 220.127.116.11, but I saw their ICMP packets on the webserver. From inside the jail, I can connect to their hosts, so there is nothing wrong with TCP sessions. We advertise a /21 to the world, so it won't be a network boundary problem. One of them can connect to the webserver (He's running FreeBSD), and one of them cannot (He's running Windows), I see the packets of the first, but not the packets of the second (whose ICMP packets I saw). Then the one with FreeBSD tries it with his Windows machine and he can't suddenly anymore. I think we narrowed the problem down to one thing: Microsoft Windows (Ouch, it did it again).
We do more tests: On the Windows machine, we cannot ping 18.104.22.168 (but I see the ICMP packets. We cannot setup a TCP session to it (and I don't seen any TCP packets). We can ping 22.214.171.124, and we can setup a TCP session to it. Now put one and one together....
Historically, 126.96.36.199 is in a class C subnet, going from 188.8.131.52 to 184.108.40.206. These days, with Classless Inter-Domain Routing, that subnet can be split in many little subnets, or be part of a supernet. Somehow, Windows still thinks in classfull subnets (You can see it with the default subnetmask it suggests when you configure an IP address on a network interface). And it prohibits TCP traffic halfway in the IP stack traffic to that IP address. To test this, we tried the following on the Window machines:
Anyway, the webserver now runs on 220.127.116.11 and Windows machines are happy again.
See also the thread at DSL Reports.com.
Update: The problem is confirmed in Windows2000, Windows2003 and Windows XP. Vista handles the ICMP and TCP packets as expected.