Saturday, January 29, 2011

HTTP request does not reach the server occasionally. Why?

We host our web-service on a dedicated server. Sometimes (I'd say 1 out of 20) a response is not received from the server. That makes the browser fallback with time-out error.

An important detail: the request is not logged by Apache in this case. The server is not loaded, there are a lot of free memory and CPU power left.

I have profiled the problem case with tcpdump utility. These are the "good" and "bad" sessions traced by tcpdump. The request is the same in both experiments. Good - server returns response. Bad - no response, time-out error.

Do you see why the problem happens from these data? How can I move further to get closer to the source of the error?

I've replaced my real ip address with 123.45.67.890

---- Bad ----
12:23:36.366292 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:39.362394 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:23:45.365567 IP 123.45.67.890.61749 > myserver.superbservers.com.www: S 2125316338:2125316338(0) win 8192 <mss 1460,nop,nop,sackOK>
--------

---- Good ----
12:27:07.632229 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620946 IP 123.45.67.890.63914 > myserver.superbservers.com.www: S 3581365570:3581365570(0) win 8192 <mss 1460,nop,wscale 2,nop,nop,sackOK>
12:27:10.620969 IP myserver.superbservers.com.www > 123.45.67.890.63914: S 2654770980:2654770980(0) ack 3581365571 win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 6>
12:27:10.838747 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 1 win 4380
12:27:10.957143 IP 123.45.67.890.63914 > myserver.superbservers.com.www: P 1:213(212) ack 1 win 4380
12:27:10.957152 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 213 win 108
12:27:10.965543 IP myserver.superbservers.com.www > 123.45.67.890.63914: P 1:630(629) ack 213 win 108
12:27:10.965621 IP myserver.superbservers.com.www > 123.45.67.890.63914: F 630:630(0) ack 213 win 108
12:27:11.183540 IP 123.45.67.890.63914 > myserver.superbservers.com.www: . ack 631 win 4222
12:27:11.185657 IP 123.45.67.890.63914 > myserver.superbservers.com.www: F 213:213(0) ack 631 win 4222
12:27:11.185663 IP myserver.superbservers.com.www > 123.45.67.890.63914: . ack 214 win 108
--------

Details on the service.

This is a weather reporting service. It is written in Perl, backed by MySQL. The script uses several modules (from CPAN and our own).

The code is relatively simple. The script downloads the weather from another server, converts data format and returns XML response. The weather is cached in MyISAM DB. There is a world locations data-base (INNODB) that can also be requested via the script.

Hosting: SuperbHosting OS: Ubuntu

  • Try using tcpdump or wireshark to monitor the network traffic. That way at least you will know if there's a networking issue. I.e. check if the request hits the machine at all.

    Also, by default most browsers have limited (2) number of connections which can done to one and the same server. If your page has some javascript objects which "forget" to close a connections, etc., it might be that the browser never actually sends the request.

    par : Thanks, I will try tcpdump. I test my requests also with a perl client program, so the browser issues may be excluded.
    par : I have profiled server network traffic with tcpdump. See the update to the question. Can you tell from the dumps why the server does not respond?
    Sunny : From the dumps it appears that actually the request is received on the machine. Why apache does not pick it up is another story, and I have no answer for this. Can you try to run apache in the most verbose debug mode, and try to see something strange in the logs? What are you serving actually? Is it some script? Can you try to reproduce the problem with very simple static HTML page (no pics, just some text)
    par : Hi! I have added the details on our service. Yes the error repeats on a simple HTML file. The problem repeats from my home and from an internet-cafe.
    From Sunny
  • Can you try your request using only IP addresses? If so, this may help narrow down the problem.

    Are all the requests coming from the same location, which have the problem? If so, try another location, perhaps a laptop in a Starbucks or something. If it happens from more than one location, using different browsers, on a very simple page without AJAX or complicated Javascript, that is valuable information.

    If using the IP address works reliably, then it is likely DNS. Knowing the domain name in use may help narrow it down.

    par : Thank you for the valuable info. This is definetly not DNS. I have tested with a pure IP and response also gets missing occasionally.
    Michael Graff : Then it is either routing, the server (hosting company), or your client. I suspect Apache only logs the connection when it completes, as it typically reports status, bytes sent, etc. Perhaps if it gets a connection where the client sends no data or is otherwise broken, it simply never logs this. Try `tcpdump` or `wireshark` to see what is happening at the network level. Ideally, run this from the client as well, as it may be a IP fragmentation issue.
    par : I have profiled server network traffic with tcpdump. See the update to the question. Can you tell from the dumps why the server does not respond?
  • I'd go with Michael Gaff and then put some money on the hosting company - these kinds of traffic problems very easily occur with failing patch panels, nics, nic driver issues or bad cabling, amongst a thousand other infrastructure things.

    I'm counting on you having tried this from different locations (or have reports from other places with the same problems) and gotten the same problem regardless so we can rule out a problem at your end, correct?

    I'm a hardware freak so, I tend to lean towards hardware failures as the cause for weird software and network issues and mass destruction in general.

    par : Still have not tried to test from another location besides my home. Thank you Oskar, I will try to go a hardware way.
  • The problem was a large number of open TCP connections, a new connections was dropped occasionally because of this.

    From par

0 comments:

Post a Comment