Thursday, February 3, 2011

Linux server stopped responding

Hi,

One of our RedHat Linux server simply stopped responding for a few minutes. For that period of a few minutes there is absolutely no entry in the log files (under /var/log/ - messages etc) or application log files. What else could I check ?

For that period the users could not get to the application and I could not ssh to it. Cannot recall whether I tried to ping.

After that everything started working as expected !

  • Do you have any sort of trending or monitoring running against this box? If not, it may be very difficult to diagnose. This behavior could be caused by any number of things. Here are a few ideas off the top of my head:

    • transient network glitch (broadcast storm, routing loop, spanning tree topology change, etc)
    • IO Contention (did something consume all the RAM of the server, causing it to go heavily into swap land?)
    • did the server reboot?

    Going forward, I'd highly recommend getting something like Munin set up. With Munin, you'll be able to easily keep tabs on disk IO, memory usage, CPU usage, process count, network traffic, etc. Having this information makes it much easier to troubleshoot this sort of problem. Alternatively, you can install and set up sar, which gathers much of the same data, but logs it in text files, which you can inspect after the fact.

    Kevin : Agreed - it is diffcult to troubleshoot without any real data ! I will look into your suggestions. Thanks Erik !
    troyengel : quick install that's light and fast: "yum install sysstat; chkconfig sysstat on; service sysstat start" -- this will at least give you "sar" which is collecting stats on everything every 10 minutes and keeps them for 30 days.
    From ErikA
  • Given what you've described, the first place I'd look is dmesg ("dmesg | tail"). If a piece of hardware locked up, and nothing got put in /var/log/messages, nine times out of ten, it got put in dmesg.

    Did you happen to notice what the load averages were at when you logged back in?

    Kevin : Nothing in dmesg. Did not check uptime.
    Nathan Powell : Could you check uptime and tell us what it says?
    Kevin : Well, this incident occurred a few days back. So I dont think an uptime will help now. I suspect a runaway process ran caused the cpu to spike. But that does not explain why it returned to normal on its own.
    Nathan Powell : So your uptime is so horrible that you can't be sure why you rebooted last? You have a lot more to be concerned about than this hiccup.
    From BMDan
  • You said "After that ". After how long was it restored ? 1 minute ? 2 ? some seconds ?

    Is there any mounted filesystem from the network (NFS, AFS , etc) ? It reminds me the case that you a have a mounted networking filesystem and suddenly the network goes off. Then you a have a filesystem waiting for a timeout.

    Also, did you have another machine connected ? if yes, do you log arp transactions ? You may be able to find if there was disconected from his neighbours.

0 comments:

Post a Comment