Chris Brumme’s latest treatise contained the sentence “Servers must not page”. That’s because on a server, paging = death. I had occasion to meet somebody from another division who told me this little story: They had a server that went into thrashing death every 10 hours, like clockwork, and had to be rebooted. To mask the problem, the server was converted to a cluster, so what really happened was that the machines in the cluster took turns being rebooted. The clients never noticed anything, but the server administrators were really frustrated. (“Hey Clancy, looks like number 2 needs to be rebooted. She’s sucking mud.”) [Link repaired, 8am.] The reason for the server’s death? Paging. There was a four-bytes-per-request memory leak in one of the programs running on the server. Eventually, all the leakage filled available RAM and the server was forced to page. Paging means slower response, but of course the requests for service kept coming in at the normal rate. So the longer you take to turn a request around, the more requests pile up, and then it takes even longer to turn around the new requests, so even more pile up, and so on. The problem snowballed until the machine just plain keeled over. After much searching, the leak was identified and plugged. Now the servers chug along without a hitch.
(And since the reason for the cluster was to cover for the constant crashes, I suspect they reduced the size of the cluster and saved a lot of money.)
0 comments