Last Friday we started to experience slowness in the response from our production web servers. The CPU and memory on each server was not above normal. The application servers were responding as expected it was only traffic directed to or through the web servers that slowed. At first the report seemed to be only a nuisance and nothing more, but as time went on the web servers went from a 30 second response time to four or five minutes
In the web servers logs I found the following message sometimes several times a second
[Fri Apr 06 07:54:07 2012] [warn] (OS 64)The specified network name is no longer available. : winnt_accept: Asynchronous AcceptEx failed.
The frequency of the messages had increased and on each server we were getting as many as 6 or 7 a second. The slowness of the servers got progressively worse until our entire collection of web servers had crashed. After the crash of the first web server I tried to restart it using the console. The web server would not start. After the other two web servers crashed production was no longer available so we restarted the OS on the first web server that crashed, it was already broken what more could happen. After the reboot the web server started back up normally and the error messages were no longer being generated.
We then logged a PMR with IBM for more information about this issue. Within an hour I received a response that indicated that this is a known issue in a Windows environment where "other vendor's software may be installed which does not correctly implement AcceptEx or other Winsock functions" publib.boulder.ibm.com/httpserv/ihsdiag/...
We had read online that other vendors software could include “anti-virus, firewall, virtualization, or vpn”rob.brooks-bilson.com/index.cfm/2008/1/4.... Post outage we returned to each server and verified in the add remove software that not updates or new software had been installed in the last day. Anti Virus updates had run but several hours before the first record of the error in the log file. IBM did let us know that a fix does exists for this error however the version of Apache that WebSphere 6.1 runs does not support the fix.
Is anyone else running into a similar issue? How are you dealing with it? Is this a reason to go to Linux with a reimplementation a year and a half out?