MySQL Daemon Goes Belly Up — But We Know Why!!
We have been running MySQL server 5.0.45 in production here since December. Not a hitch until April the 4th. Then, while I was on vacation of course, one of our masters servers seg-faulted and died. I looked into things, but really couldn’t tell much. The one odd thing was that I had MyISAM tables marked crashed in the error log BEFORE the actual crash. While I thought that was very odd, I didn’t put two and two together. Not to give things away, but the crashed MyISAM table was the big clue. The next week I left for Santa Clara and the MySQL Users Conference on the 8th. I was only in the office for two days before I was off to California so the crash on the 4th probably didn’t get as much attention as it should have.
As I have blogged about, the conference was great. I returned home on Friday, the 18th. Turns out while we were on the plane that the same server crashed again. This was twice in two weeks. Exact same (non) symptoms — ie nothing obvious. I had talked to a couple of people at the conference about the situation and the lack of any discernible evidence (other than the MyISAM tables). Both people I talked to said it would probably be hardware. By this point I already thought so myself, but it just reinforced my belief. I thought it was either hardware or my servers just miss me when I am not around! We had set up another server to be the slave of the newly promoted master that replaced the crashed server so there was no real rush to get the hardware from the colo. Yesterday we went to the colo for some work and brought back the server that had crashed I plugged it in on our workbench and started up memtest. This morning when I came in … eight memory errors.
So, if you have MyISAM tables and they get marked as crashed in your error log and the server didn’t actually crash you should consider it very likely you have some type of hardware problem. Memory just happens to be the most likely candidate.
Hope this helps someone when they don’t know why the server crashed!!
5 Comments so far
Leave a reply
I suggest that you replace your database memory with ECC. That machine could have been silently corrupting data for awhile until the server finally noticed and crashed on a row it hadn’t touched in awhile.
I am (almost) 100% certain it was ECC. I will double-check tomorrow. But, unless I post another comment about the RAM tomorrow, it is safe to assume it is ECC.
Hi,
I have seen situation where due to bad memory the FS started to get corrupted. Ext2 by that time.
If it is ECC, I would suggest you monitor /var/log/mcelog, or wherever your particular operating system stores “machine check event” information. You should see information about both corrected and uncorrected memory errors in the mcelog if you have ECC memory. This log also contains information about ECC correction in the CPU cache, so it will let you know when a CPU is flaky too.
Further, Linux can be configured to dump on a 2+bit ECC memory error (afaik ECC can only correct one bit errors) so that you don’t end up with corrupted data due to memory errors.
Keith,
You have the core dumps from the crash? If so, send them to me, I’ll take a look.
Regards,
Mark